Experiment Engine#

ExperimentEngine is a two-phase orchestrator for structured rolling-origin ablation experiments. It handles backbone HPO, ablation condition evaluation across multiple datasets and CV folds, result aggregation, and MLflow tracking in a single reusable entry point. Experiment scripts define a declarative ExperimentSpec and call ExperimentEngine.run(spec).

Architecture#

        graph TD
    A[ExperimentSpec] --> B[ExperimentEngine.run]
    B --> C{Phase 1: Backbone HPO}
    C --> D[run_backbone_hpo per dataset]
    D --> E[Optuna study - SQLite]
    E --> F[save best_params.npy]
    B --> G{Phase 2: Ablation}
    G --> H[For each condition]
    H --> I[For each dataset]
    I --> J[For each CV fold]
    J --> K[load_backbone_params]
    K --> L[Apply condition overrides]
    L --> M[fit + evaluate]
    M --> N[log to MLflow]
    G --> O[aggregate per group]
    O --> P[CSV full + summary]

Phase 1 - Backbone HPO searches the backbone model’s hyperparameter space using Optuna and saves the best parameters to disk. This phase runs once per dataset and is skipped automatically on subsequent runs when the cached result exists.

Phase 2 - Ablation evaluates each condition in ExperimentSpec.conditions across all datasets and CV folds using the backbone parameters from Phase 1 as the starting point. Each condition can selectively override any config field. Results are aggregated across folds and saved as CSV files.

Defining an Experiment#

ExperimentSpec#

ExperimentSpec declares the full experiment: which backbone to tune, which datasets to run on, and which conditions to compare.

from twiga.core.config import DataPipelineConfig
from twiga.experiment import ExperimentSpec, Condition

spec = ExperimentSpec(
    name="ganf_ablation",
    backbone_cls=MLPGAFConfig,
    datasets={"ETTh1": (train_df, val_df, test_df), "ETTm1": (...)},
    pipeline_cfg=DataPipelineConfig(...),
    conditions=[
        Condition(name="baseline",    overrides={}),
        Condition(name="no_gate",     overrides={"use_gate": False}),
        Condition(name="3group",      overrides={"group_scheme": "3group"}),
        Condition(name="input_gate",  overrides={"gate_position": "input"}),
    ],
    results_dir="results/ganf_ablation",
    hpo_trials=30,
)

Field	Type	Description
`name`	`str`	Experiment name. Used as the MLflow experiment name and file prefix
`backbone_cls`	type	Config class for the backbone model (e.g. `MLPGAFConfig`)
`datasets`	`dict[str, tuple]`	Maps dataset name to `(train_df, val_df, test_df)`
`pipeline_cfg`	`DataPipelineConfig`	Shared data pipeline configuration
`conditions`	`list[Condition]`	Ablation conditions to evaluate
`results_dir`	`str \| Path`	Root directory for CSV output and HPO caches
`hpo_trials`	`int`	Number of Optuna trials for backbone HPO
`cv_params`	`ExperimentConfig \| None`	Cross-validation split settings. Defaults to single train/val/test

Condition#

Each Condition names one ablation arm and specifies field overrides applied on top of the backbone config.

Condition(
    name="latent_gate_2g",
    overrides={
        "group_scheme": "2group",
        "gate_position": "latent",
        "fusion_type": "gated",
    },
    hpo_variant=None,   # shares backbone HPO cache with other 2group conditions
)

Field	Type	Description
`name`	`str`	Unique condition identifier. Appears in result DataFrames and MLflow run names
`overrides`	`dict[str, Any]`	Config fields to override. Applied after backbone HPO params are loaded
`hpo_variant`	`str \| None`	When set, selects a separate HPO cache key. Use when an override changes architecture in a way that invalidates backbone params (e.g. a different `group_scheme`)

When to set hpo_variant

Two conditions that differ only in a regularisation weight can share backbone HPO results. Two conditions that differ in group_scheme or gate_position produce different network shapes and should use separate HPO caches. Set hpo_variant to a short descriptive string (e.g. "3group") to create an isolated cache for that arm.

Running an Experiment#

From Python#

from twiga.experiment import ExperimentEngine

engine = ExperimentEngine()
engine.run(spec)

run() executes both phases in order and writes results under spec.results_dir.

From the command line#

ExperimentEngine.cli_main provides a minimal CLI entry point. Experiment scripts expose it directly:

python experiment/ganf_ablation.py --datasets ETTh1 ETTm1 --hpo-trials 50

The script calls:

if __name__ == "__main__":
    ExperimentEngine.cli_main(spec)

Config Precedence#

When ExperimentEngine builds the model config for a condition it merges four layers in order (lowest to highest priority):

Config class defaults
ExperimentSpec-level overrides (applied to every condition)
Backbone HPO best parameters loaded from cache
Condition.overrides

This means a condition override always wins over HPO-tuned params, which in turn always win over spec-level settings.

HPO Cache#

Backbone HPO results are cached as .npy files under results_dir. The cache key is {backbone_cls.name}_{dataset}_{hpo_variant or "default"}. Re-running a script with --hpo-trials greater than the previous run will resume the existing Optuna study and add more trials; it will not restart from scratch.

Load cached params directly with:

from twiga.experiment import load_backbone_params

params = load_backbone_params(
    results_dir="results/ganf_ablation",
    backbone_name="mlpgaf",
    dataset="ETTh1",
    hpo_variant=None,
)

MLflow Tracking#

Set MLFLOW_TRACKING_URI before running to enable automatic experiment tracking:

export MLFLOW_TRACKING_URI=http://localhost:5000
python experiment/ganf_ablation.py

The run hierarchy mirrors the experiment structure:

Experiment: <spec.name>
  Parent run: <dataset>
    Child run: hpo/<dataset>/<model>      run_backbone_hpo()
    Child run: <condition>/<dataset>/fold_N

All condition overrides, fold metrics, and HPO parameters are logged as MLflow params and metrics automatically. No additional instrumentation is needed in the experiment script.

detect_tracking_uri is a safe helper that returns the configured URI or None if MLflow is absent or unconfigured:

from twiga.experiment import detect_tracking_uri

uri = detect_tracking_uri()  # None if not configured

Aggregating Results#

aggregate computes mean and standard deviation per (group, condition, metric_type) across folds and writes two CSV files: a full long-form table and a summary table.

from twiga.experiment import aggregate
from pathlib import Path

summary = aggregate(
    combined=full_df,
    prefix="ganf_ablation",
    root=Path("results/ganf_ablation"),
    suffix="_val",
    reference_conditions={"ETTh1": "baseline", "ETTm1": "baseline"},
)

ExperimentEngine.run calls aggregate automatically at the end of Phase 2. Call it manually only when post-processing results outside an engine run.

Visualising Results#

Use twiga.core.plot.evaluation to plot aggregated results. See Visualization for the full plotting API.

from pathlib import Path
from twiga.core.plot import load_ablation, plot_summary_heatmap, save_ablation_plots

result = load_ablation(
    root=Path("results/ganf_ablation"),
    prefix="ganf_ablation",
    suffix="_val",
)

p = plot_summary_heatmap(result.summary, metric="mae", reference="baseline")
p

save_ablation_plots(result, output_dir=Path("figures/ganf_ablation"), reference="baseline")

API Reference#

class twiga.experiment.ExperimentEngine(spec)#

Bases: object

Runs a ExperimentSpec end to end.

Usage:

engine = ExperimentEngine(SPEC)
engine.cli_main(base_cfg=PipelineConfig(...))

Or programmatically:

summary = engine.run(base_cfg, groups=["gating"], dataset_keys=["MLVS-PT"])

cli_main(base_cfg, argv=None)#

Parse CLI args then call run().

Recognised flags: --group, --dataset, --skip-hpo, --tracking-uri, --epochs, --num-trials, --folds.

Parameters:

base_cfg (PipelineConfig) – Root pipeline config (output_dir, epochs, etc.).
argv (list[str] | None) – Argument list; defaults to sys.argv[1:].

Return type:

None

run(base_cfg, groups=None, dataset_keys=None, skip_hpo=False, tracking_uri=None)#

Run all conditions × datasets and return a cross-condition summary.

Parameters:

base_cfg (PipelineConfig) – Root PipelineConfig. Dataset-specific keys are applied on top via dataclasses.replace.
groups (list[str] | None) – Condition groups to run. None runs all groups.
dataset_keys (list[str] | None) – Dataset keys from spec.datasets. None runs all datasets.
skip_hpo (bool) – Skip Phase 1 backbone HPO (reuse saved params).
tracking_uri (str | None) – MLflow tracking URI. Falls back to the MLFLOW_TRACKING_URI / TWIGA_MLFLOW_TRACKING_URI env vars. Pass None to disable tracking entirely.

Return type:

DataFrame

Returns:

Summary DataFrame with mean ± std per condition.

class twiga.experiment.ExperimentSpec(name, output_prefix, condition_cls, backbone_cls, conditions, datasets, controlled_fields=<factory>, fixed_overrides=<factory>, cv_train_size=12, cv_test_size=4, cv_val_size=2, cv_calib_size=0, cv_stride=1, cv_folds=10, hemisphere='NH', reference_conditions=<factory>, plot_figures=True, save_condition_plots=True, sample_plot_steps=336)#

Bases: object

Full declaration of a twiga ablation / benchmark experiment.

Pass an instance to ExperimentEngine to run the experiment.

Variables:

name – Human-readable experiment title (used in logs and plot titles).
output_prefix – Prefix for all CSV output files (e.g. "mlgaf_ablation" → mlgaf_ablation_summary.csv).
condition_cls – Model config class instantiated per condition (e.g. MLPGAFConfig).
backbone_cls – Model config class used for Phase 1 backbone HPO. Must be specified explicitly — typically the plain backbone without a probabilistic head (e.g. MLPGAMConfig for a CRC experiment).
conditions – List of Condition objects defining the experimental grid.
datasets – Registry of datasets. Keys are short names used with --dataset; values are dicts of PipelineConfig field overrides (dataset_name, train_start, window_stride, …).
hemisphere – Meteorological hemisphere used when annotating fold seasons in the summary. "NH" (default) uses Northern-Hemisphere conventions (Dec–Feb = Winter). Use "SH" for Southern Hemisphere sites where seasons are reversed.

CV protocol (all fields default to the standard 10-fold expanding window):

cv_train_size: Initial training window in split_freq units. cv_test_size: Test window per fold in split_freq units. cv_val_size: Validation window carved from the training tail. cv_calib_size: Calibration window for conformal experiments (0 =

disabled).

cv_stride: Advance between folds. cv_folds: Maximum number of folds.

Output:

fixed_overrides: Applied to every model config before backbone params: (e.g. {"use_revin": False, "value_embed_type": "ConvEmb"}).
controlled_fields: Stripped from backbone HPO params so ablation: overrides always win.
reference_conditions: Maps group name → reference condition name for: Δ-vs-reference columns in the summary.

plot_figures: Whether to call save_ablation_plots after the run.

backbone_cls: type#

condition_cls: type#

conditions: list[Condition]#

controlled_fields: frozenset#

cv_calib_size: int = 0#

cv_folds: int = 10#

cv_stride: int = 1#

cv_test_size: int = 4#

cv_train_size: int = 12#

cv_val_size: int = 2#

datasets: dict[str, dict]#

fixed_overrides: dict#

hemisphere: Literal['NH', 'SH'] = 'NH'#

name: str#

output_prefix: str#

plot_figures: bool = True#

reference_conditions: dict[str, str]#

sample_plot_steps: int = 336#

save_condition_plots: bool = True#

class twiga.experiment.Condition(name, group, description='', overrides=<factory>, model_cls=None, hpo_variant='', metric_types=<factory>, conformal_config=None, stage1_epochs_frac=None, calib_source='train_tail')#

Bases: object

One experimental condition — what varies between backtesting runs.

Variables:

name – Short identifier used in filenames and summaries.
group – Experiment group this condition belongs to (e.g. "gating").
description – Human-readable note, shown in logs.
overrides – Key–value pairs applied to the model config after backbone HPO params and fixed overrides. These always win.
model_cls – Override the spec’s condition_cls for this condition. Use for multi-model experiments (e.g. MLPF vs MLPGAM vs MLPGAF).
metric_types – Which evaluation methods to call per fold. Each entry maps to one call: "point" → evaluate_point_forecast; "interval" → evaluate_interval_forecast; "quantile" → evaluate_quantile_forecast. Defaults to ["point"].
conformal_config – When set the forecaster is given these conformal params and calib_size from the spec is used for calibration within each backtesting fold.

calib_source: str = 'train_tail'#

conformal_config: ConformalConfig | None = None#

description: str = ''#

group: str#

hpo_variant: str = ''#

metric_types: list[str]#

model_cls: type | None = None#

name: str#

overrides: dict#

stage1_epochs_frac: float | None = None#

twiga.experiment.run_backbone_hpo(backbone_cls, cfg, data, target_series, calendar_variables, exogenous_features, lags, latitude, longitude, dataset_key, hpo_cache_dir, hpo_variant='')#

Run Optuna HPO for backbone_cls on a fixed 14-month / 2-month split.

Saves best params to <hpo_cache_dir>/<dataset_key>/<model_name>_best_params.json and returns the param dict. The file is shared across runs — params are never recomputed unless the file is deleted.

Parameters:

backbone_cls (type) – Model config class with a from_data_config factory.
cfg (PipelineConfig) – Pipeline config providing train_start, epochs, num_trials.
data (DataFrame) – Full dataset DataFrame (must have a timestamp column).
target_series (str) – Target variable name.
calendar_variables (list) – Calendar feature names.
exogenous_features (list) – Exogenous feature names.
lags (list) – Lag indices.
latitude (float) – Site latitude (used by some feature builders).
longitude (float) – Site longitude.
dataset_key (str) – Short dataset identifier used in the cache path.
hpo_cache_dir (Path) – Root directory for cached HPO params.
hpo_variant (str) – Optional suffix appended to the model name when computing the cache key (e.g. "3group" → mlpgaf_3group_best_params.json). Allows a single config class to have per-variant HPO files.

Return type:

dict

Returns:

Best hyperparameter dict (same format as load_backbone_params()).

twiga.experiment.load_backbone_params(dataset_key, hpo_cache_dir, model_name_str, controlled_fields, fallback_paths=None)#

Load saved backbone HPO params, strip controlled fields and model prefix.

Searches fallback_paths first (in order), then the canonical engine path. Returns an empty dict — with a warning — when no file is found.

Parameters:

dataset_key (str) – Short dataset identifier (e.g. "MLVS-PT").
hpo_cache_dir (Path) – Root directory for cached HPO params (typically <experiment_root>/backbone_hpo).
model_name_str (str) – Model name string (e.g. "mlpgaf").
controlled_fields (frozenset) – Keys to strip from the loaded params so ablation condition overrides always take precedence.
fallback_paths (list[Path] | None) – Additional JSON files to try before the canonical path.

Return type:

dict

Returns:

Dict of hyperparameter names → values, ready to setattr onto a model config object.

twiga.experiment.aggregate(combined, prefix, root, suffix, reference_conditions)#

Compute mean ± std per (group, condition, metric_type) and save CSVs.

Parameters:

combined (DataFrame) – Long-form DataFrame with one row per fold/horizon, tagged with group, condition, dataset, and optionally metric_type columns.
prefix (str) – Filename prefix for output CSVs.
root (Path) – Directory to write <prefix>_full<suffix>.csv and <prefix>_summary<suffix>.csv.
suffix (str) – Optional tag appended to filenames (e.g. "_val").
reference_conditions (dict[str, str]) – Maps group → reference condition name for Δ-vs-reference columns.

Return type:

DataFrame

Returns:

Summary DataFrame with MultiIndex (group, condition[, metric_type]) and one column per metric, plus _std variants, n_runs, and Δ columns. Empty DataFrame if no recognised metric columns are present.