Skip to main content

Physical motivation

Running molecular dynamics simulations at elevated temperatures or pressures is one of the most demanding practical uses of an MLIP. A model that looks good on static benchmarks can fail catastrophically in MD — producing unbounded forces, violating energy conservation, or crashing within picoseconds of simulation time. This benchmark quantifies simulation survival rate: what fraction of MD runs does a model complete without failing? It also measures inference speed (steps per second as a function of system size), which determines the practical cost of using a model for long-timescale simulations. Two protocols are tested:
  • Heating (NVT) — isochoric-isothermal molecular dynamics with a temperature ramp from 300 K to 3000 K over 10 ps
  • Compression (NPT) — isothermal-isobaric molecular dynamics with simultaneous temperature (300 K → 3000 K) and pressure (0 GPa → 500 GPa) ramps over 10 ps
The compression protocol is particularly demanding because it requires the barostat to adjust the cell volume while the thermostat ramps the temperature. Models that lack robust NPT integration or produce unphysical stresses under high pressure fail quickly.

Structures tested

Simulations are run on structures from the RM24 dataset, which contains a diverse set of inorganic crystal structures spanning multiple chemical families.

Temperature and pressure ranges

ProtocolEnsembleTemperature rangePressure rangeDuration
HeatingNVT (Nosé-Hoover)300 K → 3000 KN/A10 ps
CompressionNPT (Nosé-Hoover)300 K → 3000 K0 → 500 GPa10 ps
Not all models support NPT dynamics. eqV2(OMat), EquiformerV2(OC22), EquiformerV2(OC20), and eSCN(OC20) have npt: false in the registry due to known issues with stress tensor computation.

Metrics

MetricDescription
Valid runs (%)Percentage of simulations that complete the full duration without crashing
Normalized final stepFinal simulation step divided by total target steps — a continuous survival proxy
Steps per secondMD throughput as a function of number of atoms (log-log scale)
Power-law scaling exponentFitted exponent n where steps/s ∝ N⁻ⁿ — measures how throughput degrades with system size
The survival plot on the leaderboard shows cumulative valid runs as a function of normalized simulation time. A model that crashes early shows a steep drop in the cumulative curve; a robust model stays near 100% throughout.
Inference speed is measured and plotted on a log-log scale with power-law fits. The exponent n reflects the model’s scaling with system size — lower n means better scalability to large systems.

Model support

The following models have results for the stability benchmark. Support requires the gpu-tasks: stability entry and, for NPT, npt: true in the model registry.
ModelNVT supportNPT supportTraining data
MACE-MP(M)YesYesMPTrj
MACE-MPAYesYesMPTrj, Alexandria
CHGNetYesYesMPTrj
M3GNetYesYesMPF
MatterSimYesYesMPTrj, Alexandria
ORBv2YesYesMPTrj, Alexandria
ORBYesYesMPTrj, Alexandria
SevenNetYesYesMPTrj

How to run

Two Jupyter notebooks orchestrate the benchmark runs:
  • benchmarks/stability/temperature.ipynb — NVT heating runs
  • benchmarks/stability/pressure.ipynb — NPT compression runs
Both notebooks use Prefect flows backed by Dask-JobQueue for parallel dispatch to SLURM.
1

Configure SLURM

Edit the cluster settings in benchmarks/stability/run.py. The default allocates 4 GPUs per node with a 4-hour wall time.
cluster_kwargs = dict(
    cores=1,
    memory="64 GB",
    account="your-account",
    walltime="04:00:00",
)
cluster = SLURMCluster(**cluster_kwargs)
cluster.adapt(minimum_jobs=10, maximum_jobs=50)
2

Run heating simulations

Open and run benchmarks/stability/temperature.ipynb. Results are saved as <model>-heating.parquet files in benchmarks/stability/<family>/.
3

Run compression simulations

Open and run benchmarks/stability/pressure.ipynb. Results are saved as <model>-compression.parquet files.
4

Analyze results

Open benchmarks/stability/plot.ipynb to generate the survival and speed-scaling figures.Alternatively, run the analysis script directly:
python benchmarks/stability/analysis.py
The temperature.ipynb notebook is the recommended starting point. It runs NVT simulations, which are supported by all models in the registry.

Interpreting results

Survival rate is the primary metric. A model that survives 100% of NVT heating runs is a prerequisite for use in production MD simulations at elevated temperatures. Models with survival rates below 50% should not be used for dynamics without careful per-system validation. Inference speed determines practical usability. The log-log speed vs. atoms plot reveals:
  • The absolute throughput at a given system size
  • How throughput degrades as system size grows (the power-law exponent)
  • Models with favorable message-passing architectures that scale sub-quadratically with atom count
NPT vs. NVT survival comparison reveals whether a model’s stress tensor implementation is reliable. A large drop in survival rate from NVT to NPT indicates stress-related failures under compression.