Physical motivation
Running molecular dynamics simulations at elevated temperatures or pressures is one of the most demanding practical uses of an MLIP. A model that looks good on static benchmarks can fail catastrophically in MD — producing unbounded forces, violating energy conservation, or crashing within picoseconds of simulation time. This benchmark quantifies simulation survival rate: what fraction of MD runs does a model complete without failing? It also measures inference speed (steps per second as a function of system size), which determines the practical cost of using a model for long-timescale simulations. Two protocols are tested:- Heating (NVT) — isochoric-isothermal molecular dynamics with a temperature ramp from 300 K to 3000 K over 10 ps
- Compression (NPT) — isothermal-isobaric molecular dynamics with simultaneous temperature (300 K → 3000 K) and pressure (0 GPa → 500 GPa) ramps over 10 ps
Structures tested
Simulations are run on structures from the RM24 dataset, which contains a diverse set of inorganic crystal structures spanning multiple chemical families.Temperature and pressure ranges
| Protocol | Ensemble | Temperature range | Pressure range | Duration |
|---|---|---|---|---|
| Heating | NVT (Nosé-Hoover) | 300 K → 3000 K | N/A | 10 ps |
| Compression | NPT (Nosé-Hoover) | 300 K → 3000 K | 0 → 500 GPa | 10 ps |
Metrics
| Metric | Description |
|---|---|
| Valid runs (%) | Percentage of simulations that complete the full duration without crashing |
| Normalized final step | Final simulation step divided by total target steps — a continuous survival proxy |
| Steps per second | MD throughput as a function of number of atoms (log-log scale) |
| Power-law scaling exponent | Fitted exponent n where steps/s ∝ N⁻ⁿ — measures how throughput degrades with system size |
Inference speed is measured and plotted on a log-log scale with power-law fits. The exponent n reflects the model’s scaling with system size — lower n means better scalability to large systems.
Model support
The following models have results for the stability benchmark. Support requires thegpu-tasks: stability entry and, for NPT, npt: true in the model registry.
| Model | NVT support | NPT support | Training data |
|---|---|---|---|
| MACE-MP(M) | Yes | Yes | MPTrj |
| MACE-MPA | Yes | Yes | MPTrj, Alexandria |
| CHGNet | Yes | Yes | MPTrj |
| M3GNet | Yes | Yes | MPF |
| MatterSim | Yes | Yes | MPTrj, Alexandria |
| ORBv2 | Yes | Yes | MPTrj, Alexandria |
| ORB | Yes | Yes | MPTrj, Alexandria |
| SevenNet | Yes | Yes | MPTrj |
How to run
Two Jupyter notebooks orchestrate the benchmark runs:benchmarks/stability/temperature.ipynb— NVT heating runsbenchmarks/stability/pressure.ipynb— NPT compression runs
Configure SLURM
Edit the cluster settings in
benchmarks/stability/run.py. The default allocates 4 GPUs per node with a 4-hour wall time.Run heating simulations
Open and run
benchmarks/stability/temperature.ipynb. Results are saved as <model>-heating.parquet files in benchmarks/stability/<family>/.Run compression simulations
Open and run
benchmarks/stability/pressure.ipynb. Results are saved as <model>-compression.parquet files.Interpreting results
Survival rate is the primary metric. A model that survives 100% of NVT heating runs is a prerequisite for use in production MD simulations at elevated temperatures. Models with survival rates below 50% should not be used for dynamics without careful per-system validation. Inference speed determines practical usability. The log-log speed vs. atoms plot reveals:- The absolute throughput at a given system size
- How throughput degrades as system size grows (the power-law exponent)
- Models with favorable message-passing architectures that scale sub-quadratically with atom count