Skip to main content
NeurIPS 2025 Spotlight — MLIP Arena has been accepted as a Spotlight at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025). Read the paper
Foundation machine learning interatomic potentials (MLIPs), trained on extensive databases containing millions of density functional theory (DFT) calculations, have revolutionized molecular and materials modeling. However, existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific DFT references. MLIP Arena is a unified benchmark platform that evaluates foundation MLIP performance beyond conventional error metrics. It focuses on revealing the physical soundness learned by MLIPs and assessing their utilitarian performance — agnostic to underlying model architecture and training dataset.

Why MLIP Arena?

Beyond Error Metrics

Move past static DFT reference comparisons. MLIP Arena reveals failure modes in real-world physical tasks like MD stability and combustion.

Fair Benchmarking

Reproducible, leakage-free benchmarks designed to be agnostic to model architecture and training dataset.

15+ Foundation Models

Unified interface for MACE-MP, CHGNet, M3GNet, SevenNet, ORBv2, eqV2, eSEN, MatterSim, ALIGNN, ANI2x, and more.

HPC-Scale Workflows

Prefect-powered orchestration for parallel benchmark execution on high-throughput computing clusters.

Key Capabilities

Modular Tasks

OPT, EOS, MD, PHONON, NEB, ELASTICITY — composable and reusable across benchmarks.

Physical Soundness Tests

Homonuclear diatomics, energy conservation, force equivariance, equation of state.

Live Leaderboard

Real-time benchmark results visualized on Hugging Face Spaces with interactive Streamlit dashboards.

Benchmark Suite

MLIP Arena evaluates models across two main categories: Fundamentals — tests of basic physical correctness: Molecular Dynamics — tests of dynamics stability and chemistry:
  • MD Stability — long-timescale NVT/NPT simulation stability
  • Combustion — reactive molecular dynamics for combustion reactions

Supported Models

ModelFamilyTraining DataPredictions
MACE-MP(M)MACEMPTrjEFS
MACE-MPAMACEMPTrj, AlexandriaEFS
CHGNetCHGNetMPTrjEFSM
M3GNetMatGLMPFEFS
MatterSimMatterSimMPTrj, AlexandriaEFS
ORBv2ORBMPTrj, AlexandriaEFS
SevenNetSevenNetMPTrjEFS
eqV2(OMat)FairChemOMat, MPTrj, AlexandriaEFS
eSENFairChemOMat, MPTrj, AlexandriaEFS
ANI2xANICOMP6EFS
ALIGNNALIGNNMP22EFS
DeepMDDeepMDMPTrjEFS

Quick Start

pip install mlip-arena
from mlip_arena.models import MLIPEnum
from mlip_arena.tasks import MD
from mlip_arena.tasks.utils import get_calculator
from ase.build import bulk

atoms = bulk("Cu", "fcc", a=3.6) * (3, 3, 3)

result = MD(
    atoms=atoms,
    calculator=get_calculator(MLIPEnum["MACE-MP(M)"]),
    ensemble="nvt",
    total_time=1000,  # 1 ps
    time_step=2,      # fs
)

Installation

Install from PyPI or build from source with all model dependencies.

Quickstart

Run your first benchmark in minutes.

Citation

If you use MLIP Arena in your research, please cite:
@inproceedings{
    chiang2025mlip,
    title={{MLIP} Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform},
    author={Yuan Chiang and Tobias Kreiman and Christine Zhang and Matthew C. Kuner and Elizabeth Jin Weaver and Ishan Amin and Hyunsoo Park and Yunsung Lim and Jihan Kim and Daryl Chrzan and Aron Walsh and Samuel M Blau and Aditi S. Krishnapriyan and Mark Asta},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2025},
    url={https://openreview.net/forum?id=SAT0KPA5UO}
}