Why go beyond DFT error metrics?
Conventional MLIP benchmarks measure how closely a model reproduces DFT-computed energies and forces on a held-out test set. This approach has three well-known failure modes:- Data leakage — test structures are often drawn from the same distribution as training data, inflating apparent accuracy.
- Limited transferability — low force errors on a dataset do not guarantee correct behavior for out-of-distribution structures or physical conditions.
- DFT reference dependence — models trained on different functionals (PBE, r2SCAN, HSE) are hard to compare fairly against a single reference.
MLIP Arena was accepted as a NeurIPS 2025 Spotlight and an ICLR 2025 AI4Mat Spotlight. Read the paper
Benchmark categories
Benchmarks are grouped into two categories that reflect different aspects of model quality.Fundamentals
Fundamentals benchmarks test whether a model has learned the basic rules of interatomic interactions. They use static calculations (energy and force evaluations at fixed geometries) and do not require running molecular dynamics. They are therefore fast to run and easy to interpret.| Benchmark | What it tests | Status |
|---|---|---|
| Homonuclear diatomics | Dissociation energy curves for elemental pairs A₂ | Active |
| Equation of state | Energy–volume curves and bulk moduli for WBM bulk crystals | Active |
| Energy-volume scans | Energy–volume profiles across the full WBM dataset | Active |
Molecular Dynamics
Molecular Dynamics benchmarks test whether a model can sustain a simulation without diverging, crashing, or producing unphysical trajectories. They reveal failure modes that static benchmarks cannot detect.| Benchmark | What it tests | Status |
|---|---|---|
| MD Stability | NVT/NPT survival rate under heating and compression ramps | Active |
| Combustion | Hydrogen combustion yield, reaction enthalpy, and COM drift | Active |
The modular task system
Every benchmark is built from composable tasks inmlip_arena.tasks. This means individual calculations are reusable across benchmarks and can be orchestrated in parallel with Prefect.
The core tasks are:
| Task | Import | Description |
|---|---|---|
OPT | mlip_arena.tasks.optimize | Structure optimization (BFGS, FrechetCell filter) |
EOS | mlip_arena.tasks.eos | Equation of state via Birch-Murnaghan fit |
MD | mlip_arena.tasks.md | Molecular dynamics (NVE, NVT, NPT, temperature scheduling) |
PHONON | mlip_arena.tasks.phonon | Phonon calculation via phonopy |
NEB | mlip_arena.tasks.neb | Nudged elastic band |
ELASTICITY | mlip_arena.tasks.elasticity | Elastic tensor calculation |
Live leaderboard
All benchmark results are visualized in real time on the MLIP Arena Hugging Face Space: huggingface.co/spaces/atomind/mlip-arena The leaderboard is a Streamlit application. For each benchmark you can:- Select which models to display
- Toggle between individual curves and aggregate metrics
- Download raw result data
The leaderboard reflects the results stored in
benchmarks/ in the main repository. Results are updated when new benchmark runs are merged.Running benchmarks locally
Install the package first:benchmarks/<name>/ folder. Most benchmarks provide either a Python script (run.py) or a Jupyter notebook (run.ipynb).
Benchmark pages
Homonuclear diatomics
Dissociation energy curves for all elemental homonuclear pairs.
Equation of state
Bulk moduli and E-V curves for WBM bulk crystals.
Energy-volume scans
Energy-volume profiles across the WBM dataset without structure relaxation.
MD Stability
NVT/NPT simulation survival rate under heating and compression.
Combustion
Hydrogen combustion yield and reaction enthalpy from reactive MD.