Skip to main content

Why go beyond DFT error metrics?

Conventional MLIP benchmarks measure how closely a model reproduces DFT-computed energies and forces on a held-out test set. This approach has three well-known failure modes:
  • Data leakage — test structures are often drawn from the same distribution as training data, inflating apparent accuracy.
  • Limited transferability — low force errors on a dataset do not guarantee correct behavior for out-of-distribution structures or physical conditions.
  • DFT reference dependence — models trained on different functionals (PBE, r2SCAN, HSE) are hard to compare fairly against a single reference.
MLIP Arena moves the evaluation target from reproducing DFT numbers to producing physically consistent behavior. Each benchmark probes a concrete physical question — does the model predict the correct dissociation limit? Does it keep a bulk crystal stable under compression? Does it reproduce the reaction enthalpy for hydrogen combustion? These questions have answers that do not depend on a specific DFT functional.
MLIP Arena was accepted as a NeurIPS 2025 Spotlight and an ICLR 2025 AI4Mat Spotlight. Read the paper

Benchmark categories

Benchmarks are grouped into two categories that reflect different aspects of model quality.

Fundamentals

Fundamentals benchmarks test whether a model has learned the basic rules of interatomic interactions. They use static calculations (energy and force evaluations at fixed geometries) and do not require running molecular dynamics. They are therefore fast to run and easy to interpret.
BenchmarkWhat it testsStatus
Homonuclear diatomicsDissociation energy curves for elemental pairs A₂Active
Equation of stateEnergy–volume curves and bulk moduli for WBM bulk crystalsActive
Energy-volume scansEnergy–volume profiles across the full WBM datasetActive

Molecular Dynamics

Molecular Dynamics benchmarks test whether a model can sustain a simulation without diverging, crashing, or producing unphysical trajectories. They reveal failure modes that static benchmarks cannot detect.
BenchmarkWhat it testsStatus
MD StabilityNVT/NPT survival rate under heating and compression rampsActive
CombustionHydrogen combustion yield, reaction enthalpy, and COM driftActive

The modular task system

Every benchmark is built from composable tasks in mlip_arena.tasks. This means individual calculations are reusable across benchmarks and can be orchestrated in parallel with Prefect. The core tasks are:
TaskImportDescription
OPTmlip_arena.tasks.optimizeStructure optimization (BFGS, FrechetCell filter)
EOSmlip_arena.tasks.eosEquation of state via Birch-Murnaghan fit
MDmlip_arena.tasks.mdMolecular dynamics (NVE, NVT, NPT, temperature scheduling)
PHONONmlip_arena.tasks.phononPhonon calculation via phonopy
NEBmlip_arena.tasks.nebNudged elastic band
ELASTICITYmlip_arena.tasks.elasticityElastic tensor calculation
Benchmarks combine these tasks and wrap them in Prefect flows for parallel dispatch across HPC clusters.
from prefect import flow
from mlip_arena.models import MLIPEnum
from mlip_arena.tasks import MD
from mlip_arena.tasks.utils import get_calculator
from ase.build import bulk

atoms = bulk("Cu", "fcc", a=3.6) * (5, 5, 5)

@flow
def run_all_models():
    futures = []
    for model in MLIPEnum:
        future = MD.submit(
            atoms=atoms,
            calculator=get_calculator(model),
            ensemble="nvt",
            total_time=1000,
            time_step=2,
        )
        futures.append(future)
    return [f.result(raise_on_failure=False) for f in futures]
Prefect caches task results automatically. Re-running a benchmark after adding a new model only executes the new model’s tasks.

Live leaderboard

All benchmark results are visualized in real time on the MLIP Arena Hugging Face Space: huggingface.co/spaces/atomind/mlip-arena The leaderboard is a Streamlit application. For each benchmark you can:
  • Select which models to display
  • Toggle between individual curves and aggregate metrics
  • Download raw result data
The leaderboard reflects the results stored in benchmarks/ in the main repository. Results are updated when new benchmark runs are merged.

Running benchmarks locally

Install the package first:
pip install mlip-arena
For benchmarks that require pretrained model weights, install from source:
git clone https://github.com/atomind-ai/mlip-arena.git
cd mlip-arena
bash scripts/install.sh
Then follow the per-benchmark instructions in each benchmarks/<name>/ folder. Most benchmarks provide either a Python script (run.py) or a Jupyter notebook (run.ipynb).
GPU-based benchmarks (all tasks listed under gpu-tasks in the model registry) require a CUDA-capable device. The Stability and EOS benchmarks are designed for HPC clusters using SLURM and Dask-JobQueue.

Benchmark pages

Homonuclear diatomics

Dissociation energy curves for all elemental homonuclear pairs.

Equation of state

Bulk moduli and E-V curves for WBM bulk crystals.

Energy-volume scans

Energy-volume profiles across the WBM dataset without structure relaxation.

MD Stability

NVT/NPT simulation survival rate under heating and compression.

Combustion

Hydrogen combustion yield and reaction enthalpy from reactive MD.