Rules & Evaluation

How submissions are scored

Three complementary metrics combine into a single composite score, with a separate cross-machine generalization award and a harmonization quality gate.

The metrics

What we measure

R²ψ

Flux accuracy

Coefficient of determination of the predicted flux map over all grid points, timesteps, and test shots. The primary signal of reconstruction quality.

R²ₛ

Scalar fidelity

Mean R² across the five scalar equilibrium parameters — βN, li, q95, Raxis, Zaxis.

D_LCFS

Boundary alignment

Symmetric Hausdorff distance between predicted and true last-closed-flux-surface contours, normalized by the true LCFS major radius. Smaller is better.

The scores that win awards

Composite & generalization

Award #1 · intra-machine

Composite score S_model

S = 0.6·R²ψ + 0.25·R²ₛ + 0.15·(1 − D_LCFS)

A weighted blend in [0, 1]. R² terms are clipped at 0 and D_LCFS at 1. The highest S_model on the hidden DIII-D test set wins Award #1.

Award #2 · cross-machine

Generalization ratio G_ratio

G = S_model(MAST) / S_model(DIII-D)

The fraction of DIII-D performance retained under zero-shot transfer to MAST. Values near 1 indicate near-complete transfer. Admissibility gate: R²ψ > 0.6 on DIII-D.

Diagram of the evaluation metrics combining flux accuracy, scalar fidelity, and LCFS alignment.
The three metrics feed the composite score; cross-machine transfer is scored separately.
Before you can win

Quality gate & submission format

Harmonization quality gate

Submissions are graded on completeness and reproducibility (τ_H = 0.90): required variables present, time-aligned, unit-consistent, and artifacts that regenerate identically under a clean rebuild (SHA-256 match). Failing entries stay on the leaderboard but are ineligible for awards.

Submission format

Predictions as .npz or NetCDF4 indexed by record ID and EFIT timestamp, plus a manifest naming the harmonization layer. Scoring runs CPU-only in under five minutes per pass.
Two phases

Development, then a blind final

Phase 1 — development

~3 months
A public leaderboard scores 50% of the held-out shots with bootstrap uncertainty estimates, so you can iterate.

Phase 2 — final

Blind
The remaining, blind half of the test set is scored privately to determine the winners.
The fine print

Rules & eligibility

  1. 1 Each team registers on Codabench with a single valid contact email; aliases that already exist as solo participants are deactivated.
  2. 2 External public datasets (other tokamak archives, OMFIT-produced equilibrium tables) and publicly available pre-trained vision or scientific foundation models are permitted, with explicit disclosure in the methods report.
  3. 3 No restriction is placed on programming language or framework.
  4. 4 All train/test splits must respect shot-level boundaries.
  5. 5 Harmonization layers must regenerate deterministically under a clean rebuild (SHA-256 hash match).
  6. 6 Top three teams in each award category must release source code under an OSI-approved licence and submit a 1–2 page methods report before prizes are paid.
  7. 7 Organizing-team members with access to the hidden ground truth are excluded from prize eligibility.
  8. 8 All participants follow the NeurIPS Code of Conduct.
Integrity

Preventing overfitting & leakage

Submission caps

Limited to 5 per day and 100 in total per team to discourage leaderboard probing.

Shot-level splits

All train/test splits are strictly shot-level — never by timestep.

Deterministic rebuilds

Harmonization layers must regenerate identically (SHA-256 match).

Hidden ground truth

Private-fold EFIT reconstructions are never released; leakage or memorization is disqualifying.