Philosophy
MLIP Arena benchmarks are built by reusing and chaining existing tasks. A benchmark is a Prefect flow that loops over models (or structures), calls one or more modular tasks for each, and aggregates the results. The existing modular tasks you can compose are:| Import name | Module | Description |
|---|---|---|
OPT | mlip_arena.tasks.optimize | Structure optimization |
EOS | mlip_arena.tasks.eos | Equation of state (energy-volume scan) |
MD | mlip_arena.tasks.md | Molecular dynamics (NVE, NVT, NPT) |
PHONON | mlip_arena.tasks.phonon | Phonon calculation via phonopy |
NEB | mlip_arena.tasks.neb | Nudged elastic band |
NEB_FROM_ENDPOINTS | mlip_arena.tasks.neb | NEB with automatic image interpolation |
ELASTICITY | mlip_arena.tasks.elasticity | Elastic tensor calculation |
Please reuse, extend, or chain the general tasks listed above rather than reimplementing common operations. This keeps benchmarks consistent and results comparable across models.
Benchmark structure
Each benchmark lives in its own folder underbenchmarks/:
README.md should describe: what physical property is being tested, what dataset or set of structures is used, how to run the benchmark, and what the expected outputs are.
Creating a Prefect flow
Wrap your benchmark logic in a@flow-decorated function. Use .submit() on task calls to dispatch them concurrently to Prefect workers:
benchmark.py
Step-by-step guide
Identify the physical property or failure mode
Define clearly what your benchmark measures and why existing benchmarks do not cover it. Open a GitHub Discussion to get early feedback before investing significant implementation effort.
Create the benchmark folder
README.md that describes the benchmark, the dataset, and how to run it.Select and chain existing tasks
Import from If your benchmark requires a new primitive operation not covered by existing tasks, implement it as a
mlip_arena.tasks rather than re-implementing operations. For example, a bulk modulus benchmark can reuse EOS directly:@task-decorated function in mlip_arena/tasks/ and submit it for review separately.Write the Prefect flow
Create
benchmarks/my-benchmark/benchmark.py with a @flow-decorated entry point that loops over MLIPEnum, calls .submit() for each model, and returns a dictionary of results. See the example above.For HPC-scale runs with Prefect infrastructure, refer to the MD stability benchmark notebook as a practical example.Store results on Hugging Face Dataset Hub
Benchmark results are stored in the atomind/mlip-arena Hugging Face dataset repository. Create a subfolder named after your benchmark task.Results uploaded to the Hub are automatically detected by the HF webhook and reflected on the live leaderboard.
Add to task registry
Register your benchmark in The
mlip_arena/tasks/registry.yaml so it appears in the leaderboard and the Streamlit app:category groups your benchmark with related tasks in the sidebar. Current categories are Fundamentals and Molecular Dynamics.Add visualization to the Streamlit app
The live leaderboard is a Streamlit app under
serve/. Add a page for your benchmark in serve/tasks/ following the pattern of existing pages (e.g., serve/tasks/eos_bulk.py).The page should:- Load results from the HF dataset.
- Render a leaderboard table sorted by the primary metric.
- Include at least one informative plot (e.g., scatter, violin, or line chart).
streamlit run serve/app.py, include the new page file in your PR.Write a test
Add a test file in
tests/ that verifies your benchmark runs end-to-end on a small subset of models and structures:tests/test_my_benchmark.py
Example: a minimal benchmark
The following is a complete, minimal example that benchmarks bulk modulus via EOS for all models:benchmarks/bulk-modulus/benchmark.py