Experiment Engine#
Source Files
twiga/experiment/engine.py-ExperimentEngineorchestrationtwiga/experiment/spec.py-ExperimentSpecdataclasstwiga/experiment/condition.py-Conditiondataclasstwiga/experiment/hpo.py-run_backbone_hpo,load_backbone_paramstwiga/experiment/analysis.py-aggregatetwiga/experiment/tracking.py- MLflow bridge
ExperimentEngine is a two-phase orchestrator for structured rolling-origin ablation experiments. It handles backbone HPO, ablation condition evaluation across multiple datasets and CV folds, result aggregation, and MLflow tracking in a single reusable entry point. Experiment scripts define a declarative ExperimentSpec and call ExperimentEngine.run(spec).
Architecture#
graph TD
A[ExperimentSpec] --> B[ExperimentEngine.run]
B --> C{Phase 1: Backbone HPO}
C --> D[run_backbone_hpo per dataset]
D --> E[Optuna study - SQLite]
E --> F[save best_params.npy]
B --> G{Phase 2: Ablation}
G --> H[For each condition]
H --> I[For each dataset]
I --> J[For each CV fold]
J --> K[load_backbone_params]
K --> L[Apply condition overrides]
L --> M[fit + evaluate]
M --> N[log to MLflow]
G --> O[aggregate per group]
O --> P[CSV full + summary]
Phase 1 — Backbone HPO searches the backbone model’s hyperparameter space using Optuna and saves the best parameters to disk. This phase runs once per dataset and is skipped automatically on subsequent runs when the cached result exists.
Phase 2 — Ablation evaluates each condition in ExperimentSpec.conditions across all datasets and CV folds using the backbone parameters from Phase 1 as the starting point. Each condition can selectively override any config field. Results are aggregated across folds and saved as CSV files.
Defining an Experiment#
ExperimentSpec#
ExperimentSpec declares the full experiment: which backbone to tune, which datasets to run on, and which conditions to compare.
from twiga.core.config import DataPipelineConfig
from twiga.experiment import ExperimentSpec, Condition
spec = ExperimentSpec(
name="ganf_ablation",
backbone_cls=MLPGAFConfig,
datasets={"ETTh1": (train_df, val_df, test_df), "ETTm1": (...)},
pipeline_cfg=DataPipelineConfig(...),
conditions=[
Condition(name="baseline", overrides={}),
Condition(name="no_gate", overrides={"use_gate": False}),
Condition(name="3group", overrides={"group_scheme": "3group"}),
Condition(name="input_gate", overrides={"gate_position": "input"}),
],
results_dir="results/ganf_ablation",
hpo_trials=30,
)
Field |
Type |
Description |
|---|---|---|
|
|
Experiment name. Used as the MLflow experiment name and file prefix |
|
type |
Config class for the backbone model (e.g. |
|
|
Maps dataset name to |
|
|
Shared data pipeline configuration |
|
|
Ablation conditions to evaluate |
|
|
Root directory for CSV output and HPO caches |
|
|
Number of Optuna trials for backbone HPO |
|
|
Cross-validation split settings. Defaults to single train/val/test |
Condition#
Each Condition names one ablation arm and specifies field overrides applied on top of the backbone config.
Condition(
name="latent_gate_2g",
overrides={
"group_scheme": "2group",
"gate_position": "latent",
"fusion_type": "gated",
},
hpo_variant=None, # shares backbone HPO cache with other 2group conditions
)
Field |
Type |
Description |
|---|---|---|
|
|
Unique condition identifier. Appears in result DataFrames and MLflow run names |
|
|
Config fields to override. Applied after backbone HPO params are loaded |
|
|
When set, selects a separate HPO cache key. Use when an override changes architecture in a way that invalidates backbone params (e.g. a different |
When to set hpo_variant
Two conditions that differ only in a regularisation weight can share backbone HPO results. Two conditions that differ in group_scheme or gate_position produce different network shapes and should use separate HPO caches. Set hpo_variant to a short descriptive string (e.g. "3group") to create an isolated cache for that arm.
Running an Experiment#
From Python#
from twiga.experiment import ExperimentEngine
engine = ExperimentEngine()
engine.run(spec)
run() executes both phases in order and writes results under spec.results_dir.
From the command line#
ExperimentEngine.cli_main provides a minimal CLI entry point. Experiment scripts expose it directly:
python experiment/ganf_ablation.py --datasets ETTh1 ETTm1 --hpo-trials 50
The script calls:
if __name__ == "__main__":
ExperimentEngine.cli_main(spec)
Config Precedence#
When ExperimentEngine builds the model config for a condition it merges four layers in order (lowest to highest priority):
Config class defaults
ExperimentSpec-level overrides (applied to every condition)Backbone HPO best parameters loaded from cache
Condition.overrides
This means a condition override always wins over HPO-tuned params, which in turn always win over spec-level settings.
HPO Cache#
Backbone HPO results are cached as .npy files under results_dir. The cache key is {backbone_cls.name}_{dataset}_{hpo_variant or "default"}. Re-running a script with --hpo-trials greater than the previous run will resume the existing Optuna study and add more trials; it will not restart from scratch.
Load cached params directly with:
from twiga.experiment import load_backbone_params
params = load_backbone_params(
results_dir="results/ganf_ablation",
backbone_name="mlpgaf",
dataset="ETTh1",
hpo_variant=None,
)
MLflow Tracking#
Set MLFLOW_TRACKING_URI before running to enable automatic experiment tracking:
export MLFLOW_TRACKING_URI=http://localhost:5000
python experiment/ganf_ablation.py
The run hierarchy mirrors the experiment structure:
Experiment: <spec.name>
Parent run: <dataset>
Child run: hpo/<dataset>/<model> run_backbone_hpo()
Child run: <condition>/<dataset>/fold_N
All condition overrides, fold metrics, and HPO parameters are logged as MLflow params and metrics automatically. No additional instrumentation is needed in the experiment script.
detect_tracking_uri is a safe helper that returns the configured URI or None if MLflow is absent or unconfigured:
from twiga.experiment import detect_tracking_uri
uri = detect_tracking_uri() # None if not configured
Aggregating Results#
aggregate computes mean and standard deviation per (group, condition, metric_type) across folds and writes two CSV files: a full long-form table and a summary table.
from twiga.experiment import aggregate
from pathlib import Path
summary = aggregate(
combined=full_df,
prefix="ganf_ablation",
root=Path("results/ganf_ablation"),
suffix="_val",
reference_conditions={"ETTh1": "baseline", "ETTm1": "baseline"},
)
ExperimentEngine.run calls aggregate automatically at the end of Phase 2. Call it manually only when post-processing results outside an engine run.
Visualising Results#
Use twiga.core.plot.evaluation to plot aggregated results. See Visualization for the full plotting API.
from pathlib import Path
from twiga.core.plot import load_ablation, plot_summary_heatmap, save_ablation_plots
result = load_ablation(
root=Path("results/ganf_ablation"),
prefix="ganf_ablation",
suffix="_val",
)
p = plot_summary_heatmap(result.summary, metric="mae", reference="baseline")
p
save_ablation_plots(result, output_dir=Path("figures/ganf_ablation"), reference="baseline")
API Reference#
- class twiga.experiment.ExperimentEngine(spec)#
Bases:
objectRuns a
ExperimentSpecend to end.Usage:
engine = ExperimentEngine(SPEC) engine.cli_main(base_cfg=PipelineConfig(...))
Or programmatically:
summary = engine.run(base_cfg, groups=["gating"], dataset_keys=["MLVS-PT"])
- cli_main(base_cfg, argv=None)#
Parse CLI args then call
run().Recognised flags:
--group,--dataset,--skip-hpo,--tracking-uri,--epochs,--num-trials,--folds.
- run(base_cfg, groups=None, dataset_keys=None, skip_hpo=False, tracking_uri=None)#
Run all conditions × datasets and return a cross-condition summary.
- Parameters:
base_cfg (
PipelineConfig) – RootPipelineConfig. Dataset-specific keys are applied on top viadataclasses.replace.groups (
list[str] |None) – Condition groups to run.Noneruns all groups.dataset_keys (
list[str] |None) – Dataset keys fromspec.datasets.Noneruns all datasets.skip_hpo (
bool) – Skip Phase 1 backbone HPO (reuse saved params).tracking_uri (
str|None) – MLflow tracking URI. Falls back to theMLFLOW_TRACKING_URI/TWIGA_MLFLOW_TRACKING_URIenv vars. PassNoneto disable tracking entirely.
- Return type:
- Returns:
Summary
DataFramewith mean ± std per condition.
- class twiga.experiment.ExperimentSpec(name, output_prefix, condition_cls, backbone_cls, conditions, datasets, controlled_fields=<factory>, fixed_overrides=<factory>, cv_train_size=12, cv_test_size=4, cv_val_size=2, cv_calib_size=0, cv_stride=1, cv_folds=10, hemisphere='NH', reference_conditions=<factory>, plot_figures=True, save_condition_plots=True, sample_plot_steps=336)#
Bases:
objectFull declaration of a twiga ablation / benchmark experiment.
Pass an instance to
ExperimentEngineto run the experiment.- Variables:
name – Human-readable experiment title (used in logs and plot titles).
output_prefix – Prefix for all CSV output files (e.g.
"mlgaf_ablation"→mlgaf_ablation_summary.csv).condition_cls – Model config class instantiated per condition (e.g.
MLPGAFConfig).backbone_cls – Model config class used for Phase 1 backbone HPO. Must be specified explicitly — typically the plain backbone without a probabilistic head (e.g.
MLPGAMConfigfor a CRC experiment).conditions – List of
Conditionobjects defining the experimental grid.datasets – Registry of datasets. Keys are short names used with
--dataset; values are dicts ofPipelineConfigfield overrides (dataset_name,train_start,window_stride, …).hemisphere – Meteorological hemisphere used when annotating fold seasons in the summary.
"NH"(default) uses Northern-Hemisphere conventions (Dec–Feb = Winter). Use"SH"for Southern Hemisphere sites where seasons are reversed.
- CV protocol (all fields default to the standard 10-fold expanding window):
cv_train_size: Initial training window in
split_frequnits. cv_test_size: Test window per fold insplit_frequnits. cv_val_size: Validation window carved from the training tail. cv_calib_size: Calibration window for conformal experiments (0 =disabled).
cv_stride: Advance between folds. cv_folds: Maximum number of folds.
- Output:
- fixed_overrides: Applied to every model config before backbone params
(e.g.
{"use_revin": False, "value_embed_type": "ConvEmb"}).- controlled_fields: Stripped from backbone HPO params so ablation
overrides always win.
- reference_conditions: Maps group name → reference condition name for
Δ-vs-reference columns in the summary.
plot_figures: Whether to call
save_ablation_plotsafter the run.
- class twiga.experiment.Condition(name, group, description='', overrides=<factory>, model_cls=None, hpo_variant='', metric_types=<factory>, conformal_config=None, stage1_epochs_frac=None, calib_source='train_tail')#
Bases:
objectOne experimental condition — what varies between backtesting runs.
- Variables:
name – Short identifier used in filenames and summaries.
group – Experiment group this condition belongs to (e.g.
"gating").description – Human-readable note, shown in logs.
overrides – Key–value pairs applied to the model config after backbone HPO params and fixed overrides. These always win.
model_cls – Override the spec’s
condition_clsfor this condition. Use for multi-model experiments (e.g. MLPF vs MLPGAM vs MLPGAF).metric_types – Which evaluation methods to call per fold. Each entry maps to one call:
"point"→evaluate_point_forecast;"interval"→evaluate_interval_forecast;"quantile"→evaluate_quantile_forecast. Defaults to["point"].conformal_config – When set the forecaster is given these conformal params and
calib_sizefrom the spec is used for calibration within each backtesting fold.
- conformal_config: ConformalConfig | None = None#
- twiga.experiment.run_backbone_hpo(backbone_cls, cfg, data, target_series, calendar_variables, exogenous_features, lags, latitude, longitude, dataset_key, hpo_cache_dir, hpo_variant='')#
Run Optuna HPO for backbone_cls on a fixed 14-month / 2-month split.
Saves best params to
<hpo_cache_dir>/<dataset_key>/<model_name>_best_params.jsonand returns the param dict. The file is shared across runs — params are never recomputed unless the file is deleted.- Parameters:
backbone_cls (
type) – Model config class with afrom_data_configfactory.cfg (
PipelineConfig) – Pipeline config providingtrain_start,epochs,num_trials.data (
DataFrame) – Full dataset DataFrame (must have atimestampcolumn).target_series (
str) – Target variable name.calendar_variables (
list) – Calendar feature names.exogenous_features (
list) – Exogenous feature names.lags (
list) – Lag indices.latitude (
float) – Site latitude (used by some feature builders).longitude (
float) – Site longitude.dataset_key (
str) – Short dataset identifier used in the cache path.hpo_cache_dir (
Path) – Root directory for cached HPO params.hpo_variant (
str) – Optional suffix appended to the model name when computing the cache key (e.g."3group"→mlpgaf_3group_best_params.json). Allows a single config class to have per-variant HPO files.
- Return type:
- Returns:
Best hyperparameter dict (same format as
load_backbone_params()).
- twiga.experiment.load_backbone_params(dataset_key, hpo_cache_dir, model_name_str, controlled_fields, fallback_paths=None)#
Load saved backbone HPO params, strip controlled fields and model prefix.
Searches fallback_paths first (in order), then the canonical engine path. Returns an empty dict — with a warning — when no file is found.
- Parameters:
dataset_key (
str) – Short dataset identifier (e.g."MLVS-PT").hpo_cache_dir (
Path) – Root directory for cached HPO params (typically<experiment_root>/backbone_hpo).model_name_str (
str) – Model name string (e.g."mlpgaf").controlled_fields (
frozenset) – Keys to strip from the loaded params so ablation condition overrides always take precedence.fallback_paths (
list[Path] |None) – Additional JSON files to try before the canonical path.
- Return type:
- Returns:
Dict of hyperparameter names → values, ready to
setattronto a model config object.
- twiga.experiment.aggregate(combined, prefix, root, suffix, reference_conditions)#
Compute mean ± std per (group, condition, metric_type) and save CSVs.
- Parameters:
combined (
DataFrame) – Long-form DataFrame with one row per fold/horizon, tagged withgroup,condition,dataset, and optionallymetric_typecolumns.prefix (
str) – Filename prefix for output CSVs.root (
Path) – Directory to write<prefix>_full<suffix>.csvand<prefix>_summary<suffix>.csv.suffix (
str) – Optional tag appended to filenames (e.g."_val").reference_conditions (
dict[str,str]) – Maps group → reference condition name for Δ-vs-reference columns.
- Return type:
- Returns:
Summary
DataFramewith MultiIndex (group, condition[, metric_type]) and one column per metric, plus_stdvariants,n_runs, and Δ columns. Empty DataFrame if no recognised metric columns are present.