Data

The dataset

The first open-access, harmonized multi-machine benchmark for fusion equilibrium — curated for Thomson-diagnostic availability, feature completeness, and EFIT-reconstruction quality.

The challenge releases 9,113 DIII-D shots and 2,416 MAST shots (~11,500 total, ~35 GB), each filtered so that the inputs and EFIT ground truth a model needs are actually present and reconstruction-quality. The full corpus is hosted on Hugging Face; the datasets library supports streaming.

Every shot is one row of a Parquet file. Each time series and profile is a nested array inside that single row — so df['efit_psirz'].iloc[0] is a list of 2D flux grids, not a column of scalars. Always take .iloc[0], then index into the nested array.

Six sample shots (3 DIII-D, 3 MAST) ship in the starter kit so you can explore the schema immediately, before the full release.

Data pipeline: raw multi-rate diagnostics harmonized into one-row-per-shot Parquet records.
Multi-rate diagnostics harmonized into one row per shot.
Two machines

What differs between DIII-D and MAST

The differences below are load-bearing for any code touching the data.

DIII-D (conventional) MAST (spherical)
Type Conventional tokamak (D-shaped) Spherical tokamak (low aspect ratio)
Facility DIII-D · General Atomics, San Diego MAST · UKAEA Culham, UK
Shots (challenge set) 9,113 2,416
Target flux grid 65 × 65 65 × 129 (~50% NaN central column)
Shaping coils 18 F-coils (F1A–F9B) + ECOILA, bcoil 10 P-coils (P2L–P6U) + sol, tf, efps
Magnetics time base Per-signal (~49k samples) Single shared (~15k samples)
Typical flux value ≈ −0.25 V·s/rad ≈ +0.05 V·s/rad
What’s inside a shot

Signals & targets

Target — EFIT ψ(R,Z)

Predict this
efit_psirz: a sequence of poloidal flux maps (one per efit_times slice) — the ground truth your model reconstructs, kept exactly as EFIT produced it.

Magnetics — coil currents

Input
F-coils / P-coils, ohmic solenoid, toroidal-field coil, plasma current, and the dsep X-point gap. Sampled at tens of kHz on per-signal (DIII-D) or shared (MAST) time bases.

Thomson scattering

Input
Core (vertical) and edge (horizontal) electron temperature Te (eV) and density ne (m⁻³) profiles, on their own time bases.
Read this first

Modeling rules the data treats as load-bearing

Align time bases — but only the inputs

Magnetics (~49k/15k samples), EFIT targets (~300/80 frames), and Thomson all live on different clocks. Resample inputs onto EFIT times; never interpolate the targets.

Split by shot, never by timestep

Timesteps within a shot are highly correlated. Splitting by timestep leaks the test set; always split at the shot level.

Normalize your inputs

Signal scales span ~10⁴ A coil currents to ~10⁶ A plasma current. Normalize before training.

Compress the target

The 4,225-pixel flux map compresses to ~20–50 PCA coefficients (component 1 alone ≈ 92% of variance) — the recommended baseline for strong generalization.
MAST’s NaN region is expected. A spherical tokamak has no plasma in its narrow central column, so roughly half of the 65×129 MAST flux grid is NaN by design — not missing data.
Access & license

Open by design

License

Dataset released under CC BY 4.0 for open research.

Format

One-row-per-shot Parquet with nested arrays; streamable via Hugging Face datasets.

Identifiers

Source-prefixed record IDs — DIII-D_182494, MAST_25607.