API Reference#

Complete reference for all public classes, functions, and exceptions exported by twiga.

Note

Symbols marked stable follow semantic versioning. Symbols marked experimental may change in minor versions.


Entry Point#

class twiga.forecaster.core.TwigaForecaster(data_params, model_params, cv_params=None, conformal_params=None, training_params=None)

Bases: BaseForecaster

Machine Learning Forecaster for time series predictions.

This forecaster initializes a data pipeline and dynamically loads machine learning models based on provided configurations. The configurations can be specified as Pydantic models or dictionaries. Once the models are loaded, they can be trained, evaluated, and backtested.

Example

>>> from twiga.core.config import BaseModelConfig, DataPipelineConfig, ExperimentConfig
>>> data_params = DataPipelineConfig(date_column="date", ...)
>>> model_config = BaseModelConfig(name="linear", ...)
>>> cv_params = ExperimentConfig(...)
>>> forecaster = TwigaForecaster(data_params, model_config, cv_params)
>>> forecaster.fit(train_df)
>>> predictions, metrics = forecaster.evaluate_point_forecast(test_df)
__init__(data_params, model_params, cv_params=None, conformal_params=None, training_params=None)

Initialize TwigaForecaster.

Parameters:
  • data_params (DataPipelineConfig) – Configuration for the data pipeline.

  • model_params (BaseModelConfig | list[BaseModelConfig] | dict | list[dict]) – Configuration for the model(s). Can be a single Pydantic config, a dictionary, or a list of either. Neural network configs with unset dims (num_target_feature, forecast_horizon, lookback_window_size equal to 0) are auto-populated from data_params. Base arch configs with a distribution field set are automatically resolved to the corresponding probabilistic variant (e.g. MLPFConfig(distribution='normal') becomes an MLPFNormalConfig).

  • cv_params (ExperimentConfig | None) – Cross-validation and experiment configuration. Defaults to ExperimentConfig() when not provided.

  • conformal_params (ConformalConfig | None) – Optional conformal prediction configuration.

  • training_params (NeuralTrainingConfig | None) – Training infrastructure overrides applied to all NN model configs (e.g. NeuralTrainingConfig(early_stop_patience=None, max_epochs=50)). Only non-None fields are applied.

calibrate(calibrate_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)

Calibrate conformal prediction models using calibration data.

Parameters:
  • calibrate_df (DataFrame | None) – Calibration dataset. If None, uses the stored training data.

  • covariate_df (DataFrame | None) – Optional covariate dataset.

  • ensemble_strategy (str | None) – Strategy for combining model predictions.

  • ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy.

Raises:

ValueError – If conformal_params is not set.

Return type:

None

explain(X, model_idx=0, n_background=100)

Compute SHAP feature attributions for a fitted ML model.

Builds a ShapExplainer for the model at position model_idx in self.models, runs SHAP over X, and returns a ShapResult with values reshaped to (B, L, F) - one attribution per sample, per lookback step, per feature.

Only ML models (domain="ml") are supported. Neural-network models require gradient-based attribution and are not currently handled.

Parameters:
  • X (ndarray) – Feature array of shape (B, L, F) as produced by the data pipeline (e.g. from DataPipeline.transform()).

  • model_idx (int) – Index into self.models of the model to explain. Defaults to 0 (the first / only model).

  • n_background (int) – Number of background samples for LinearExplainer and KernelExplainer. Ignored for tree models.

Return type:

ShapResult

Returns:

ShapResult with –

  • values - SHAP array (B, L, F)

  • feature_names - original F feature names

  • timestep_labels - L lookback labels ('t-L+1''t0')

  • expected_value - SHAP base value (mean prediction)

Raises:

Example

>>> result = forecaster.explain(X_test)
>>> result.plot_importance(top_n=20)
>>> importance = result.mean_importance()
classmethod quick(target, period, horizon, model='catboost', distribution=None, lookback=None, calendar=None, scaler='standard', seed=42)

Minimal factory for getting started quickly.

Builds DataPipelineConfig and a model config from a handful of plain-Python arguments, then returns a ready-to-use TwigaForecaster.

Parameters:
  • target (str | list[str]) – Target column name(s) to forecast.

  • period (str) – Sampling frequency (pandas offset alias, e.g. "1h", "30min").

  • horizon (int) – Number of future steps to predict.

  • model (str) – Model name registered in the model registry (e.g. "catboost", "lightgbm", "mlpf"). Defaults to "catboost".

  • distribution (str | None) – Probabilistic distribution variant for base NN architectures (e.g. "normal", "laplace"). Pass this together with a base arch name like "mlpf" or "nhits" to select the matching probabilistic variant automatically. Defaults to None (point forecast).

  • lookback (int | None) – Lookback window size. Defaults to max(2 * horizon, 24).

  • calendar (list[str] | None) – Calendar feature names (e.g. ["hour", "day_of_week"]). Defaults to None.

  • scaler (str) – Target scaler identifier (e.g. "standard", "minmax"). Defaults to "standard".

  • seed (int) – Global random seed. Defaults to 42.

Return type:

TwigaForecaster

Returns:

Configured TwigaForecaster ready for fit().

Example:

forecaster = TwigaForecaster.quick(
    target="load_kw",
    period="1h",
    horizon=24,
    model="lightgbm",
    calendar=["hour", "day_of_week"],
)
forecaster.fit(train_df)

Configuration#

class twiga.core.config.DataPipelineConfig(**data)

Bases: BaseModel

Configuration for a time-series data pipeline.

Captures everything the pipeline needs to know about the raw dataset: which column to forecast, which features are available, how long the lookback and forecast windows are, what scalers to apply, and which lag/rolling-window features to engineer.

Feature category guide — classify each feature by when its values are available:

Parameters:
  • target_feature (list[str] | str) – Target variable name(s) to forecast.

  • period (str) – Sampling frequency using pandas offset aliases (e.g. "1H", "30min").

  • lookback_window_size (int | str) – Number of past timesteps fed to the model as input, or a pandas-compatible duration string that is converted to timesteps using period (e.g. "7D" with period="1H" gives 168 steps).

  • forecast_horizon (int | str) – Number of future timesteps to predict, or a duration string converted in the same way (e.g. "1D" with period="30min" gives 48 steps).

  • latitude (float | None, optional) – Latitude for day/night feature calculation. Defaults to None.

  • longitude (float | None, optional) – Longitude for day/night feature calculation. Defaults to None.

  • past_features (list[str] | None, optional) – Features available only in the lookback window (unknown in the forecast horizon). Defaults to None.

  • calendar_features (list[CalendarFeature] | None, optional) – Temporal features derived from the timestamp column. Accepts raw component names (e.g. "hour", "wday") or Fourier-encoded column names auto-selected by the dataset loader (e.g. "hour_cosin", "yweek_cos"). See CalendarFeature for all valid values. Defaults to None.

  • known_future_features (list[str] | None, optional) – Features known over the full lookback + forecast horizon (e.g. weather forecast that was also recorded historically). Defaults to None.

  • forecast_period_features (list[str] | None, optional) – Features known only during the forecast horizon (e.g. scheduled load). Defaults to None.

  • input_scaler (ScalerType, optional) – Scaler applied to input features. Defaults to "passthrough".

  • target_scaler (ScalerType, optional) – Scaler applied to the target variable. Defaults to "standard".

  • lags (list[int] | None, optional) – Lag intervals in periods for feature engineering. Defaults to None.

  • windows (list[int] | int | None, optional) – Window sizes for rolling statistics. Defaults to None.

  • window_funcs (list[str] | str | None, optional) – Aggregation functions applied to rolling windows (e.g. "mean", "std"). Defaults to None.

  • date_column (str, optional) – Name of the datetime column. Defaults to "timestamp".

  • window_stride (int, optional) – Step between consecutive sliding windows. 1 = fully overlapping (maximum data augmentation). Set to forecast_horizon for non-overlapping windows — recommended for baseline evaluation. Defaults to 1.

calendar_features: list[Literal['index_num', 'year', 'year_iso', 'yearstart', 'yearend', 'leapyear', 'half', 'quarter', 'quarterstart', 'quarterend', 'month', 'monthstart', 'monthend', 'yweek', 'mweek', 'wday', 'mday', 'qday', 'yday', 'weekend', 'hour', 'minute', 'second', 'msecond', 'nsecond', 'day_night', 'half_sin', 'half_cos', 'half_cosin', 'quarter_sin', 'quarter_cos', 'quarter_cosin', 'month_sin', 'month_cos', 'month_cosin', 'yweek_sin', 'yweek_cos', 'yweek_cosin', 'mweek_sin', 'mweek_cos', 'mweek_cosin', 'wday_sin', 'wday_cos', 'wday_cosin', 'mday_sin', 'mday_cos', 'mday_cosin', 'qday_sin', 'qday_cos', 'qday_cosin', 'yday_sin', 'yday_cos', 'yday_cosin', 'hour_sin', 'hour_cos', 'hour_cosin', 'minute_sin', 'minute_cos', 'minute_cosin', 'second_sin', 'second_cos', 'second_cosin']] | None
date_column: str
forecast_horizon: int
forecast_period_features: list[str] | None
input_scaler: Literal['standard', 'minmax', 'robust', 'maxabs', 'normalizer', 'quantile_uniform', 'quantile_normal', 'power_yeo_johnson', 'power_box_cox', 'passthrough']
known_future_features: list[str] | None
lags: list[int] | None
latitude: float | None
longitude: float | None
lookback_window_size: int
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_jobs: int
past_features: list[str] | None
period: str
recommended_lookback_search_space(max_multiplier=7, *, scalers=True)

Return a search space for lookback window size relative to this config’s forecast horizon.

The range is [forecast_horizon, max_multiplier * forecast_horizon], which guarantees the lookback is always at least as long as the forecast horizon. Both forecast_horizon and period are read from the config instance, so string horizons (e.g. "1D") are resolved correctly before the range is computed.

Parameters:
  • max_multiplier (int) – Upper bound expressed as a multiple of forecast_horizon. Defaults to 7, giving a search range of 1x to 7x the horizon.

  • scalers (bool) – When True, include categorical choices for input_scaler and target_scaler alongside the lookback range. Defaults to True.

Return type:

BaseSearchSpace

Returns:

A BaseSearchSpace ready to assign to search_space.

Example:

cfg = DataPipelineConfig(
    target_feature="load",
    period="30min",
    forecast_horizon="1D",  # resolved to 48 steps
    lookback_window_size=48,  # initial value, overridden by HPO
)
cfg = cfg.model_copy(update={"search_space": cfg.recommended_lookback_search_space()})
search_space: BaseSearchSpace | None
target_feature: list[str] | str
target_scaler: Literal['standard', 'minmax', 'robust', 'maxabs', 'normalizer', 'quantile_uniform', 'quantile_normal', 'power_yeo_johnson', 'power_box_cox', 'passthrough']
window_funcs: list[str] | str | None
window_stride: int
windows: list[int] | int | None
class twiga.core.config.ExperimentConfig(**data)

Bases: BaseModel

Configuration for the forecaster cross-validation runner.

Controls how the time-series is split for evaluation (split frequency, window type, train/test sizes), and holds project-level metadata such as the project name and output file name.

The date_column is intentionally absent here — it is always read from DataPipelineConfig, which is the single source of truth for dataset structure.

Parameters:
  • domain (Literal["ml"], optional) – Modelling domain identifier. Fixed to "ml"; excluded from parameter tuning. Defaults to "ml".

  • split_freq (str, optional) – Unit for train_size, test_size, and gap. One of "days", "hours", "weeks", "months", "years". Defaults to "months".

  • test_size (int, optional) – Number of split_freq units in each test fold. Defaults to 1.

  • train_size (int, optional) – Number of split_freq units in each training fold (rolling window only). Defaults to 1.

  • gap (int, optional) – Number of split_freq units between the end of the training fold and the start of the test fold. Defaults to 0.

  • stride (int | None, optional) – Step size between consecutive splits in split_freq units. None uses test_size as the stride. Defaults to None.

  • window (Literal["expanding", "rolling"], optional) – Cross-validation window strategy. Defaults to "expanding".

  • num_splits (int | None, optional) – Maximum number of CV splits. None uses all available splits. Defaults to None.

  • project_name (str, optional) – Experiment / project name used for logging and output paths. Defaults to "experiment".

  • file_name (str | None, optional) – Output file name. None auto-generates from the project name. Defaults to None.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

  • root_dir (str, optional) – Root directory for output artefacts. Defaults to "../".

  • metrics (tuple[str] | list[str] | None, optional) – Evaluation metrics to compute and log. None uses the runner’s defaults. Defaults to None.

calib_size: int
calib_source: Literal['train_tail', 'gap', 'test_prefix', 'season_matched']
checkpoints_path: str | None
domain: Literal['ml']
file_name: str | None
gap: int
metrics: tuple[str, ...] | list[str] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_splits: int | None
project_name: str
root_dir: str
seed: int
split_freq: Literal['days', 'minutes', 'hours', 'weeks', 'months', 'years']
stride: int | None
test_size: int
train_size: int
val_size: int
window: Literal['expanding', 'rolling']
class twiga.core.config.BaseModelConfig(**data)

Bases: BaseModel

Shared base configuration for all forecasting models.

Provides the name, domain, and search_space fields that every concrete config is expected to expose, along with a uniform get_optuna_params() that merges fixed config values with any search-space suggestions.

Subclass this to define model-specific configurations:

class MyModelConfig(BaseModelConfig):
    name: Literal["my_model"] = Field(default="my_model", exclude=True)
    hidden_size: int = 128
    dropout: float = 0.3
    search_space: BaseSearchSpace = BaseSearchSpace(
        hidden_size=[64, 128, 256],
        dropout=(0.0, 0.5),
    )
Parameters:
  • name (Literal["base_model"], optional) – Model type identifier. Excluded from parameter tuning. Defaults to "base_model".

  • domain (Literal["nn"], optional) – Modelling domain identifier. Excluded from parameter tuning. Defaults to "nn".

  • search_space (BaseSearchSpace | None, optional) – Hyperparameter search space. When set, its fields are merged into the output of get_optuna_params() for HPO. Defaults to None.

domain: Literal['nn']
get_optuna_params(trial)

Return fixed config values merged with Optuna search-space suggestions.

Fixed parameters come from pydantic.BaseModel.model_dump() (with name and search_space excluded). If a search_space is set, its fields are sampled for trial and override any overlapping fixed values, allowing a single config object to serve both fixed and tuned usage patterns.

Parameters:

trial (Trial) – Active Optuna trial.

Return type:

dict[str, Any]

Returns:

dict[str, Any]

Combined parameter dict ready to pass to the model

constructor.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: Literal['base_model']
search_space: BaseSearchSpace | None
to_estimator_params()

Return a parameter dict safe to pass directly to the underlying estimator.

Excludes Twiga-internal fields (name, domain, search_space) and maps the unified seed field to the library-specific keyword defined by _LIBRARY_SEED_KEY (e.g. "random_state" for sklearn estimators).

Return type:

dict[str, Any]

class twiga.core.config.ConformalConfig(**data)

Bases: BaseModel

Configuration for conformal prediction methods.

Supports three conformal predictors - residual-based, quantile-based, and residual-fitting - each with compatible nonconformity score types.

Parameters:
  • method (Literal["residual", "quantile", "residual-fitting"], optional) –

    Conformal prediction method:

    • "residual" - nonconformity scores based on absolute residuals |y - ŷ|.

    • "quantile" - quantile regression for prediction intervals.

    • "residual-fitting" - fits a secondary model to predict residuals for adaptive interval widths.

    Defaults to "residual".

  • score_type (str, optional) – Nonconformity score type. "scaled" / "unscaled" for quantile method; "res" / "sign-res" for residual-based methods. Defaults to "res".

  • alpha (float, optional) – Significance level controlling the confidence level (1 - alpha) of the prediction intervals. Must be in (0, 1). For example alpha=0.1 → 90 % coverage. Defaults to 0.1.

Raises:

ValueError – If method="quantile" is combined with a residual score type, or if a residual method is combined with a quantile score type.

Examples

>>> ConformalConfig(method="residual", score_type="res", alpha=0.1)
ConformalConfig(method='residual', score_type='res', alpha=0.1)
>>> ConformalConfig(method="quantile", score_type="scaled", alpha=0.05)
ConformalConfig(method='quantile', score_type='scaled', alpha=0.05)
alpha: Annotated[float, FieldInfo(annotation=NoneType, required=False, default=0.1, description='Significance level for prediction intervals. Controls coverage as (1 - alpha). Example: alpha=0.1 90% prediction intervals.', metadata=[Gt(gt=0.0), Lt(lt=1.0)])]
calib_method: Annotated[Literal['uniform', 'temporal'], FieldInfo(annotation=NoneType, required=False, default='uniform', description="Quantile estimation method for calibration scores. 'uniform': standard empirical quantile (equal weight per sample). 'temporal': exponentially weighted quantile (Tibshirani et al. 2019) recent calibration samples receive higher weight, reducing the influence of seasonally misaligned older samples. Only used for 'residual-fitting'.")]
classmethod from_coverage(coverage, **kwargs)

Construct from a coverage level rather than a significance level.

Parameters:
  • coverage (float) – Desired coverage probability, e.g. 0.9 for 90 % intervals. Must be in (0, 1). Converted to alpha = 1 - coverage.

  • **kwargs – Any additional ConformalConfig fields (e.g. method, score_type).

Return type:

ConformalConfig

Returns:

ConformalConfig with alpha = 1 - coverage.

Example

>>> cfg = ConformalConfig.from_coverage(0.9, method="residual")
>>> cfg.alpha
0.1
lambda_: Annotated[float, FieldInfo(annotation=NoneType, required=False, default=1.0, description="Exponential decay rate for temporal calibration weighting. Ignored when calib_method='uniform'. lambda_=0 recovers uniform weights; larger values concentrate weight on the most recent calibration samples.", metadata=[Ge(ge=0.0)])]
method: Annotated[Literal['residual', 'quantile', 'residual-fitting'], FieldInfo(annotation=NoneType, required=False, default='residual', description="Conformal prediction method. 'residual': absolute residual scores. 'quantile': quantile regression intervals. 'residual-fitting': secondary model predicts residuals for adaptive widths.")]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

score_type: Annotated[Literal['scaled', 'unscaled', 'res', 'sign-res'], FieldInfo(annotation=NoneType, required=False, default='res', description="Nonconformity score type. 'scaled'/'unscaled': for quantile method. 'res'/'sign-res': for residual-based methods.")]
validate_method_score_compatibility()

Validate that method and score_type are compatible.

Return type:

ConformalConfig

classmethod warn_extreme_alpha(v)

Warn if alpha is likely to produce degenerate intervals.

Return type:

float

class twiga.core.config.NeuralModelConfig(**data)

Bases: BaseModelConfig

Configuration for neural network-based forecasting models.

Extends BaseModelConfig with training infrastructure fields and a shared three-dict HPO system for optimizer, scheduler, and batch-size search. See the module docstring for a full explanation of the search space design.

The optimizer and scheduler are selected via optimizer_type and lr_scheduler_type. Both are captured by save_hyperparameters() in BaseNeuralModel at training time, so they must be declared as fields here.

Optional fine-grained overrides can be supplied via optimizer_params and scheduler_params. When provided they are merged into the corresponding entry of BaseNeuralModel.OPTIMIZERS / BaseNeuralModel.SCHEDULERS, allowing partial overrides (e.g. only lr) without replacing the full dict.

Parameters:
  • name (Literal["neural_model"], optional) – Model type identifier. Defaults to "neural_model".

  • domain (Literal["nn"], optional) – Modelling domain identifier. Defaults to "nn".

  • rich_progress_bar (bool, optional) – Enable rich progress bars. Defaults to True.

  • drop_last (bool, optional) – Drop the last incomplete batch. Defaults to True.

  • num_workers (int, optional) – DataLoader worker count. Defaults to 8.

  • batch_size (int, optional) – Training batch size. Defaults to 64.

  • pin_memory (bool, optional) – Pin memory for faster GPU transfer. Defaults to True.

  • max_epochs (int, optional) – Maximum training epochs. Defaults to 10.

  • early_stop_patience (int | None, optional) – Early-stopping patience in epochs. None disables early stopping. Defaults to 10.

  • resume_training (bool, optional) – Resume from last checkpoint. Defaults to True.

  • seed (int, optional) – Positive integer random seed. Defaults to 42.

  • metric (Literal["mae", "mse", "smape"], optional) – Validation metric. Defaults to "mae".

  • optimizer_type (Literal[...], optional) – Native torch.optim optimizer. Defaults to "adamw".

  • lr_scheduler_type (Literal[...], optional) – Native torch.optim.lr_scheduler class. Defaults to "multi_step".

  • optimizer_params (dict | None, optional) – Partial override for the selected optimizer’s default params. Defaults to None.

  • scheduler_params (dict | None, optional) – Partial override for the selected scheduler’s default params. Defaults to None.

BASE_TRAINING_SEARCH_SPACE: ClassVar[BaseSearchSpace] = BaseSearchSpace(optimizer_type=['adam', 'adamw'], lr_scheduler_type=['warmup_cosine', 'multi_step', 'reduce_on_plateau'], batch_size=[8, 16, 32, 64])
OPTIMIZER_PARAM_SEARCH: ClassVar[dict[str, BaseSearchSpace]] = {'adam': BaseSearchSpace(lr=(0.0001, 0.01), weight_decay=(1e-07, 0.0001)), 'adamw': BaseSearchSpace(lr=(0.0001, 0.01), weight_decay=(1e-06, 0.001)), 'muon': BaseSearchSpace(lr=(0.001, 0.1), momentum=(0.9, 0.99), ns_steps=[4, 6, 8])}
SCHEDULER_PARAM_SEARCH: ClassVar[dict[str, BaseSearchSpace]] = {'multi_step': BaseSearchSpace(prob_decay_1=(0.3, 0.6), prob_decay_2=(0.7, 0.95), gamma=[0.1, 0.2, 0.5]), 'reduce_on_plateau': BaseSearchSpace(factor=[0.1, 0.2, 0.5], prob_patience=(0.05, 0.2)), 'warmup_cosine': BaseSearchSpace(warmup_epochs=[3, 5, 10], eta_min=(1e-07, 1e-05))}
batch_size: int
domain: Literal['nn']
drop_last: bool
early_stop_patience: int | None
classmethod from_data_config(data_config, **kwargs)

Create a config instance with dimensions derived from a DataPipelineConfig.

Parameters:
  • data_config (DataPipelineConfig) – Pipeline config providing feature counts and sequence dimensions.

  • **kwargs – Additional fields forwarded to the constructor, allowing any field to be overridden at instantiation time.

Returns:

NeuralModelConfig – Populated config instance.

Raises:
  • TypeError – If data_config.target_feature is not str or list[str].

  • AttributeError – If data_config is missing forecast_horizon.

get_optuna_params(trial)

Standard HPO sampling for all neural models.

Combines child-specific architecture parameters with the standardized conditional optimizer and scheduler search space.

Return type:

dict

gradient_clip_val: float | None
lr_scheduler_type: Literal['step', 'multi_step', 'multiplicative', 'exponential', 'constant', 'linear_decay', 'polynomial', 'cosine_annealing', 'cosine_annealing_lr', 'cyclic', 'reduce_on_plateau', 'one_cycle', 'warmup_multi_step', 'warmup_cosine']
max_epochs: int
metric: Literal['mae', 'mse', 'smape']
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

monitor: Literal['loss', 'mae', 'mse', 'smape', 'sigma_loss'] | None
name: Literal['neural_model']
num_workers: int
optimizer_params: dict | None
optimizer_type: Literal['adam', 'adamw', 'nadam', 'radam', 'adamax', 'adafactor', 'adagrad', 'adadelta', 'rmsprop', 'rprop', 'asgd', 'sgd', 'muon']
pin_memory: bool
resume_training: bool
rich_progress_bar: bool
classmethod sample_training_params(trial)

Sample optimizer, scheduler, and batch-size using BaseSearchSpace logic.

Return type:

dict

scheduler_params: dict | None
seed: int
class twiga.core.config.BaseSearchSpace(**data)

Bases: BaseModel

Pydantic model for validating hyperparameter optimisation search spaces.

Each field must be either:

  • A tuple[float, float] or tuple[int, int] representing a continuous range (low, high). Float ranges spanning more than one order of magnitude (high / low >= 10) are sampled on a log scale automatically.

  • A list of at least one categorical value.

The class uses extra="allow" so that concrete search spaces can be defined inline without subclassing:

space = BaseSearchSpace(
    latent_size=[64, 128, 256],
    dropout=(0.0, 0.5),
)
Parameters:

**kwargs – Any keyword argument whose value is a valid range tuple or categorical list.

Examples

>>> space = BaseSearchSpace(lr=(1e-4, 1e-2), activation=["relu", "tanh"])
>>> params = space.get_optuna_params(trial, prefix="mlp")
get_optuna_params(trial, prefix='')

Generate Optuna parameter suggestions for all fields.

Parameters:
  • trial (Trial) – Active Optuna trial.

  • prefix (str) – Prefix prepended to each parameter name in the trial (e.g. the model name) to avoid collisions when multiple search spaces are sampled in the same trial. Defaults to "".

Return type:

dict[str, Any]

Returns:

dict[str, Any]

Mapping of field names (without prefix) to their

sampled values.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

validate_against(config)

Raise ValueError if any search space field name is not present on config.

Catches typos in search space definitions early - before an Optuna trial is run - so that mis-spelled field names produce a clear error instead of silently sampling a parameter that never gets applied.

Parameters:

config (BaseModel) – The model config instance (or class) whose fields define the valid parameter names.

Raises:

ValueError – If one or more field names in this search space do not exist on config.

Examples

Return type:

None

>>> space = BaseSearchSpace(hiddn_dim=[64, 128])  # typo!
>>> space.validate_against(my_model_config)
Traceback (most recent call last):
    ...
ValueError: Search space contains unknown fields: {'hiddn_dim'}. ...
validate_search_space()

Validate all fields have valid types and structure.

Return type:

BaseSearchSpace


Registry#

twiga.forecaster.registry.get_model(name, domain=None)

Lazily load the model and config classes from models/ml/ or models/nn/.

Parameters:
  • name (str) – The name of the model (e.g., “linear”, “lstm”).

  • domain (str | None) – The specific domain to look in (“ml” or “nn”). If None, searches both.

Return type:

tuple[type, type]

Returns:

tuple[Type, Type] – A tuple of (model_class, config_class).

Raises:

ValueError – If the model is not found in the specified or default domains.


Evaluation#

twiga.core.metrics.point.evaluate_point_forecast(result, metric_names=None, axis=1)

Evaluate point forecasts by computing daily pointwise metrics.

Parameters:
  • result (ForecastResult) – ForecastResult with ground_truth set, kind=ForecastKind.POINT.

  • metric_names (list[str] | None) – Metric names to compute. When None all supported point metrics are computed.

  • axis (int | None) – Axis along which to compute aggregate metrics. If None, metrics that require an axis will use their default behavior.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target metrics indexed by daily timestamp.

twiga.core.metrics.interval.evaluate_interval_forecast(result, alpha=0.01, true_nmpi=None, spread='std', nmpi_scale='range', axis=1, metric_names=None)

Evaluate interval forecasts by computing daily point and interval metrics.

Parameters:
  • result (ForecastResult) – ForecastResult with ground_truth, lower, and upper set, kind=ForecastKind.INTERVAL.

  • alpha (float) – Significance level used for Winkler score and coverage computations. Must be in (0, 1). Defaults to 0.01.

  • true_nmpi (float | None) – Override for κ — absolute spread of the target used as the CWE reference numerator. When None, derived from spread.

  • spread (Literal['iqr', 'mad', 'std']) – Spread measure for the CWE reference κ. "iqr" (default), "mad", or "std". See get_interval_metrics().

  • nmpi_scale (Literal['range', 'max', 'mean', 'median']) – Denominator R for NMPI and κ/R. "range" (default), "max", "mean", or "median".

  • axis (int | None) – Axis along which to compute aggregate metrics.

  • metric_names (list[str] | None) – List of interval metric names to compute.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target point and interval metrics indexed by daily timestamp.

twiga.core.metrics.quantile.tail_pinball_score(true, quantile_preds, quantile_levels, tail_taus=(0.05, 0.1, 0.9, 0.95), quantile_axis=None, axis=None)

Mean pinball loss restricted to tail quantile levels only.

Standard pinball averages uniformly over all quantile levels, so interior levels (near the median) dominate the aggregate and mask tail skill differences. This function selects only the subset of levels closest to tail_taus and computes the pinball loss on those alone, giving a metric that is directly sensitive to the quality of tail-concentrated evaluation grids (e.g. the Kumaraswamy proposal) relative to uniform ones.

For N=9 quantiles the four default targets (0.05, 0.10, 0.90, 0.95) are approximated by the nearest available levels. If the available levels do not reach below 0.10 or above 0.90 the nearest boundary level is used, and a warning is logged.

Parameters:
  • true (ndarray) – Ground truth values, any shape.

  • quantile_preds (ndarray) – Predicted quantiles with one axis holding the quantile dimension.

  • quantile_levels (ndarray | list[float] | tuple[float, ...]) – 1-D array of quantile levels in (0, 1).

  • tail_taus (tuple[float, ...]) – Target tail levels. Each entry maps to the nearest available level in quantile_levels.

  • quantile_axis (int | None) – Axis of quantile_preds holding quantiles. Inferred if None.

  • axis (int | None) – Aggregation axis over sample dimensions. Returns a scalar when None.

Return type:

float | ndarray

Returns:

Scalar tail pinball loss if axis=None, otherwise an array.


Forecast Results (experimental)#

class twiga.forecaster.result.ForecastResult(timestamps, loc, targets, model_name, kind, ground_truth=None, scale=None, quantiles=None, quantile_levels=None, conf_level=None, samples=None, lower=None, upper=None, inference_time=0.0)

Bases: object

Container for one model’s forecast output.

Variables:
  • timestamps – shape (n_batch, n_horizon, n_targets)

  • loc – point predictions (mean/median), shape (n_batch, n_horizon, n_targets)

  • targets – ordered list of target variable names

  • model_name – human-readable model identifier

  • kind – determines which optional arrays are expected and how to convert

  • ground_truth – optional, same shape as loc

  • scale – parametric std-dev / scale, same shape as loc

  • quantiles – shape (n_batch, n_q, n_horizon, n_targets)

  • quantile_levels – corresponding probability levels (e.g. [0.1, 0.5, 0.9])

  • samples – shape (n_batch, n_samples, n_horizon, n_targets)

  • lower – lower bound, same shape as loc

  • upper – upper bound, same shape as loc

  • inference_time – inference duration in seconds

  • conf_level

  • metric_name

conf_level: list[float] | ndarray | None = None
evaluate(ground_truth=None, **kwargs)

Evaluate forecast against ground truth using kind-appropriate metrics.

Forwards to twiga.core.metrics.evaluate_forecast().

Parameters:
  • ground_truth (ndarray | None) – shape (n_batch, n_horizon, n_targets). When omitted the ground_truth stored on the result is used.

  • **kwargs – forwarded to the underlying evaluate function.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target metrics.

Raises:

ValueError – if no ground truth is available.

ground_truth: ndarray | None = None
inference_time: float = 0.0
kind: ForecastKind
loc: ndarray
lower: ndarray | None = None
model_name: str
quantile_levels: list[float] | ndarray | None = None
quantiles: ndarray | None = None
samples: ndarray | None = None
scale: ndarray | None = None
targets: list[str]
timestamps: ndarray
to_dataframe(fmt='long')

Convert forecast to tidy DataFrame.

Always includes: timestamp, target, model, forecast. Optional: actual (when ground_truth is present).

Additional columns depend on forecast kind:

  • POINT: no extra columns

  • PARAMETRIC: scale

  • INTERVAL: lower, upper

  • QUANTILE (fmt=”wide”): q_0.10, q_0.50, …

  • QUANTILE (fmt=”long”): q_level, quantile_forecast

  • SAMPLES: q_0.10, q_0.50, q_0.90 (empirical quantiles)

Parameters:

fmt (str) – “long” (default) or “wide” - only affects QUANTILE

Return type:

DataFrame

Returns:

pandas DataFrame in long or wide format

Raises:

ValueError – if fmt is invalid

upper: ndarray | None = None
class twiga.forecaster.result.ForecastCollection(results=<factory>)

Bases: object

Collection of ForecastResult objects from multiple models.

add(result)

Add or replace result using its model_name as key.

Return type:

None

evaluate(**kwargs)

Evaluate all models and return a combined metrics DataFrame.

Calls ForecastResult.evaluate() on each result and concatenates the output, adding a "Model" column derived from each result’s model_name. Ground truth must be attached to each result (i.e. forecast() must have been called with test data that contains the target column).

Parameters:

**kwargs – Forwarded to each ForecastResult.evaluate() call (e.g. metric_names, freq).

Return type:

DataFrame

Returns:

Combined metrics DataFrame with a "Model" column.

Raises:

ValueError – If the collection is empty or any result lacks ground truth.

property model_names: list[str]
results: dict[str, ForecastResult]
to_dataframe(fmt='long')

Concatenate all model forecasts into one DataFrame.

Parameters:

fmt (str) – passed to each ForecastResult.to_dataframe()

Return type:

DataFrame

Returns:

Combined long-format DataFrame

Raises:

ValueError – if collection is empty

class twiga.forecaster.result.ForecastKind(*values)

Bases: StrEnum

Supported forecast output types.

Values are strings and can be used directly as dict keys.

INTERVAL = 'interval'
PARAMETRIC = 'parametric'
POINT = 'point'
QUANTILE = 'quantile'
SAMPLES = 'samples'

Ensemble (experimental)#

twiga.forecaster.ensemble.compute_ensemble_predictions(predictions, model_names, ensemble_strategy, ensemble_weights=None)

Generate ensemble predictions by combining predictions from multiple models.

Parameters:
  • predictions (list[ndarray]) – List of model predictions, where each prediction is a 3D NumPy array with shape (num_samples, horizon, num_targets).

  • model_names (list[str]) – List of model names corresponding to the predictions.

  • ensemble_strategy (EnsembleStrategy) – Strategy for combining predictions, one of EnsembleStrategy.MEAN, EnsembleStrategy.MEDIAN, or EnsembleStrategy.WEIGHTED.

  • ensemble_weights (dict[str, float] | None) – Dictionary mapping model names to their weights for the weighted ensemble strategy. Required if ensemble_strategy is EnsembleStrategy.WEIGHTED. Defaults to None.

Return type:

ndarray

Returns:

A 3D NumPy array of ensemble predictions with shape (num_samples, horizon, num_targets).

Raises:

ValueError – If predictions is empty, prediction shapes are inconsistent, weights are required but not provided, the number of weights does not match the number of models, or the ensemble strategy is unknown.


MLOps core (twiga.mlops)#

Streamlit-free building blocks for workspace orchestration, dataset transforms, storage, capture-data access, and the training model registry. The Streamlit dashboard built on top of these ships as a demo under examples/mlops/, not in the installed wheel.

twiga.mlops.workspace.create_workspace(*, name, raw_df, dataset_filename, data_setup, catalog, workspaces_dir=PosixPath('mlops_demo/workspaces'))

Create a new workspace and persist its dataset. Returns the slug.

Raises WorkspaceNameTakenError if name collides. Does not load the workspace — callers that want the splits hydrated should follow with load_workspace_data().

Order is: write dataset.parquet first, then catalog INSERT. A crash between the two leaves an orphan folder that the next attempt can safely overwrite — but never a catalog row without a dataset.

Return type:

str

twiga.mlops.workspace.load_workspace_data(slug, catalog)

Load slug: set MLflow tracking, rebuild splits, return everything.

Raises WorkspaceNotFoundError if the slug is unknown, or WorkspaceArtifactMissingError if the dataset is gone. No session state is touched.

Return type:

LoadedWorkspace

class twiga.mlops.workspace.LoadedWorkspace(slug, train_df, test_df, data_config, train_config, setup, dataset_filename, pipeline_state)

Bases: object

Everything a caller needs to hydrate after loading a workspace.

data_config: DataPipelineConfig
dataset_filename: str
pipeline_state: dict
setup: dict
slug: str
test_df: DataFrame
train_config: ExperimentConfig
train_df: DataFrame
twiga.mlops.workspace.list_workspaces(catalog)
Return type:

list[WorkspaceSummary]

twiga.mlops.data.parse_dataset(file_or_path)

Read a parquet or CSV file and return the raw DataFrame.

Dispatches on the filename extension. No column filtering, no datetime coercion, no splitting — just bytes in, frame out.

Return type:

DataFrame

twiga.mlops.data.split_raw_frame(raw_df, *, timestamp_col, target_col, exog_cols, train_cutoff, test_start)

Filter to selected columns, normalise timestamps, split on cutoffs.

Returns (train_df, test_df). Both frames have a timestamp column (renamed from timestamp_col if needed), are TZ-naive, and are de-duplicated.

Return type:

tuple[DataFrame, DataFrame]

twiga.mlops.data.build_configs_from_setup(setup, *, checkpoints_path)

Build the DataPipelineConfig + ExperimentConfig pair from a setup dict.

checkpoints_path is workspace-scoped and supplied by the caller — this module never resolves filesystem paths on its own.

Return type:

tuple[DataPipelineConfig, ExperimentConfig]

class twiga.mlops.catalog.Catalog(db_path)

Bases: object

SQLite-backed workspace catalog.

close()
Return type:

None

delete(slug)
Return type:

None

existing_slugs()
Return type:

set[str]

get(slug)
Return type:

WorkspaceRow | None

insert(row)
Return type:

None

list_all()
Return type:

list[WorkspaceRow]

name_exists(name)
Return type:

bool

rename(slug, new_name)
Return type:

None

touch_last_opened(slug)
Return type:

None

update_data_setup(slug, setup, *, dataset_filename=None, dataset_hash=None)
Return type:

None

update_pipeline_state(slug, state)
Return type:

None

class twiga.mlops.catalog.WorkspaceRow(slug, name, storage_root, tracking_uri, dataset_filename, dataset_hash, data_setup, model_config, seasonal_config, pipeline_state, created_at, updated_at, last_opened_at)

Bases: object

created_at: datetime
data_setup: dict
dataset_filename: str | None
dataset_hash: str | None
last_opened_at: datetime
model_config: dict
name: str
pipeline_state: dict
seasonal_config: dict
slug: str
storage_root: str
tracking_uri: str
updated_at: datetime
exception twiga.mlops.catalog.WorkspaceNameTakenError

Bases: Exception

Raised when an INSERT collides with the unique name constraint.

exception twiga.mlops.catalog.WorkspaceNotFoundError

Bases: Exception

Raised when a slug lookup misses.

class twiga.mlops.storage.LocalFsStorage(root)

Bases: object

Filesystem-backed workspace storage.

Layout under root:

dataset.parquet
mlruns.db
monitoring_config.json
checkpoints/
reports/
capture/
    features/        # JSONL per day — feature rows seen at /predict
    predictions/     # JSONL per day — per-model predicted values
    actuals/         # JSONL per day — ground-truth submitted via /actuals
monitoring/
    reports/         # Evidently HTML/JSON per scheduled or manual run
    runs_index.jsonl # one row per run with paths + headline metrics
actuals_dir()
Return type:

Path

capture_dir()
Return type:

Path

checkpoints_dir()
Return type:

Path

dataset_path()
Return type:

Path

ensure_initialized()
Return type:

None

features_dir()
Return type:

Path

mlruns_db_path()
Return type:

Path

monitoring_config_path()
Return type:

Path

monitoring_dir()
Return type:

Path

monitoring_reports_dir()
Return type:

Path

monitoring_runs_index_path()
Return type:

Path

predictions_dir()
Return type:

Path

read_dataset()
Return type:

DataFrame

reports_dir()
Return type:

Path

property root: Path
write_dataset(raw_df)
Return type:

None

twiga.mlops.storage.local_tracking_uri_for(storage)

Return the SQLite tracking URI for a local workspace.

Return type:

str

twiga.mlops.mlflow_query.list_runs(experiment=None, max_results=200)

Return a tidy DataFrame of runs for display in the Experiments page.

Parameters:
  • experiment (str | None) – Limit to a single experiment name. If None, search across all.

  • max_results (int) – Cap on the number of runs returned.

Return type:

DataFrame

Returns:

A DataFrame with the columns in DISPLAY_COLUMNS. Missing columns are filled with empty strings or NaN so the table never crashes on a fresh tracking store.

twiga.mlops.mlflow_query.latest_run()

Return a dict summarising the most recent run, or None if no runs exist.

Return type:

dict | None

twiga.mlops.monitoring.read_predictions(storage, *, window_start, version_id=None)
Return type:

DataFrame

twiga.mlops.monitoring.read_actuals(storage, *, window_start)
Return type:

DataFrame

twiga.mlops.monitoring.predictions_vs_actuals(storage, *, start, end, version_id=None)

Long-form chart frame with one row per (timestamp, series, value).

series is either "actual" or a model name. Predictions are deduped on (timestamp, model, target) keeping the latest received_at; actuals are deduped on timestamp. The result is sorted by timestamp ascending, ready to feed into a lets_plot geom_line with color=series.

Return type:

DataFrame

twiga.mlops.monitoring.daily_capture_counts(storage)

One row per (date, kind, version_id) with the number of captured records.

Features and predictions carry their producing version, so the table on the Monitor page can show how each deployed version’s stream evolves day by day. Actuals are version-agnostic and reported with version_id="—".

Return type:

DataFrame

twiga.mlops.monitoring.build_retraining_frame(storage, *, target_col, date_col='timestamp', version_id=None)

Assemble a training-ready frame from captured features + predictions + actuals.

Schema (one row per unique feature timestamp):

  • date_col — timestamp.

  • Every non-target feature column the caller sent on /predict.

  • predicted_<target_col> — mean of per-model predictions for that timestamp (latest write wins on duplicates).

  • actual_<target_col> — the ground-truth value. Initialised from whatever the features row carried (if anything), then overridden by the actuals log when an upload exists for that timestamp.

Both target columns can be NaN independently:

  • No actuals uploaded yet, but the model forecasted this timestamp: predicted_<target> is filled, actual_<target> is NaN.

  • Caller sent target in the lookback row but the model never forecasted it (e.g. this timestamp was always in lookback, never in any horizon): actual_<target> is filled, predicted_<target> is NaN.

Rows are emitted per feature timestamp because retraining needs inputs — predictions or actuals at timestamps with no features behind them don’t contribute and are not included.

Return type:

DataFrame

twiga.mlops.monitoring.run_batch(*, storage, reference_df, target_col, cadence, feature_cols=None, drift_threshold=0.5, now=None, version_id=None)

Run the full four-report batch for one trailing window. Persist the result.

Side effects: writes four Evidently HTML+JSON pairs into monitoring_reports_dir / <run_id>/ and appends a row to runs_index.jsonl summarising the run.

When version_id is supplied, the capture window is restricted to rows tagged with that version — this is what keeps prediction drift and performance metrics anchored to the model that actually produced them rather than mixing predictions across deployed champions. Actuals stay version-agnostic; the version filter on the prediction side carries over into the join.

Returns the run summary (suitable for surfacing in the UI).

Return type:

dict[str, Any]

twiga.mlops.training.get_registry()

Return the model registry, building it on first use.

Lazy so that importing this module (or twiga.mlops) does not eagerly scan every model config class for callers that never train.

Return type:

dict[str, ModelEntry]

twiga.mlops.training.build_model_config(entry, overrides)

Instantiate a model config, merging registry defaults with user overrides.

Return type:

BaseModelConfig


Exceptions#

exception twiga.core.exceptions.TwigaError#

Bases: Exception

Base class for all twiga library exceptions.

exception twiga.core.exceptions.ConfigurationError#

Bases: TwigaError, ValueError

Raised when a configuration is invalid or incompatible.

exception twiga.core.exceptions.MissingExtraError#

Bases: TwigaError, ImportError

Raised when an optional dependency is not installed.

exception twiga.core.exceptions.NotFittedError#

Bases: TwigaError, RuntimeError

Raised when a model or pipeline is used before fitting.

exception twiga.core.exceptions.PipelineError#

Bases: TwigaError, RuntimeError

Raised for errors in the data pipeline.

twiga.core.exceptions.require_extra(package, extra)#

Raise a helpful ImportError if an optional dependency is missing.

Parameters:
  • package (str) – The Python package to check (e.g. "shap").

  • extra (str) – The twiga extras group that provides it (e.g. "explain").

Raises:

MissingExtraError – If package cannot be imported.

Return type:

None

Example

>>> require_extra("shap", "explain")

Experiment Engine#

Run structured ablation experiments across multiple datasets, conditions, and CV folds. All MLflow tracking is automatic when MLFLOW_TRACKING_URI is set.

class twiga.experiment.ExperimentEngine(spec)

Bases: object

Runs a ExperimentSpec end to end.

Usage:

engine = ExperimentEngine(SPEC)
engine.cli_main(base_cfg=PipelineConfig(...))

Or programmatically:

summary = engine.run(base_cfg, groups=["gating"], dataset_keys=["MLVS-PT"])
cli_main(base_cfg, argv=None)

Parse CLI args then call run().

Recognised flags: --group, --dataset, --skip-hpo, --tracking-uri, --epochs, --num-trials, --folds.

Parameters:
  • base_cfg (PipelineConfig) – Root pipeline config (output_dir, epochs, etc.).

  • argv (list[str] | None) – Argument list; defaults to sys.argv[1:].

Return type:

None

run(base_cfg, groups=None, dataset_keys=None, skip_hpo=False, tracking_uri=None)

Run all conditions × datasets and return a cross-condition summary.

Parameters:
  • base_cfg (PipelineConfig) – Root PipelineConfig. Dataset-specific keys are applied on top via dataclasses.replace.

  • groups (list[str] | None) – Condition groups to run. None runs all groups.

  • dataset_keys (list[str] | None) – Dataset keys from spec.datasets. None runs all datasets.

  • skip_hpo (bool) – Skip Phase 1 backbone HPO (reuse saved params).

  • tracking_uri (str | None) – MLflow tracking URI. Falls back to the MLFLOW_TRACKING_URI / TWIGA_MLFLOW_TRACKING_URI env vars. Pass None to disable tracking entirely.

Return type:

DataFrame

Returns:

Summary DataFrame with mean ± std per condition.

class twiga.experiment.ExperimentSpec(name, output_prefix, condition_cls, backbone_cls, conditions, datasets, controlled_fields=<factory>, fixed_overrides=<factory>, cv_train_size=12, cv_test_size=4, cv_val_size=2, cv_calib_size=0, cv_stride=1, cv_folds=10, hemisphere='NH', reference_conditions=<factory>, plot_figures=True, save_condition_plots=True, sample_plot_steps=336)

Bases: object

Full declaration of a twiga ablation / benchmark experiment.

Pass an instance to ExperimentEngine to run the experiment.

Variables:
  • name – Human-readable experiment title (used in logs and plot titles).

  • output_prefix – Prefix for all CSV output files (e.g. "mlgaf_ablation"mlgaf_ablation_summary.csv).

  • condition_cls – Model config class instantiated per condition (e.g. MLPGAFConfig).

  • backbone_cls – Model config class used for Phase 1 backbone HPO. Must be specified explicitly — typically the plain backbone without a probabilistic head (e.g. MLPGAMConfig for a CRC experiment).

  • conditions – List of Condition objects defining the experimental grid.

  • datasets – Registry of datasets. Keys are short names used with --dataset; values are dicts of PipelineConfig field overrides (dataset_name, train_start, window_stride, …).

  • hemisphere – Meteorological hemisphere used when annotating fold seasons in the summary. "NH" (default) uses Northern-Hemisphere conventions (Dec–Feb = Winter). Use "SH" for Southern Hemisphere sites where seasons are reversed.

CV protocol (all fields default to the standard 10-fold expanding window):

cv_train_size: Initial training window in split_freq units. cv_test_size: Test window per fold in split_freq units. cv_val_size: Validation window carved from the training tail. cv_calib_size: Calibration window for conformal experiments (0 =

disabled).

cv_stride: Advance between folds. cv_folds: Maximum number of folds.

Output:
fixed_overrides: Applied to every model config before backbone params

(e.g. {"use_revin": False, "value_embed_type": "ConvEmb"}).

controlled_fields: Stripped from backbone HPO params so ablation

overrides always win.

reference_conditions: Maps group name → reference condition name for

Δ-vs-reference columns in the summary.

plot_figures: Whether to call save_ablation_plots after the run.

backbone_cls: type
condition_cls: type
conditions: list[Condition]
controlled_fields: frozenset
cv_calib_size: int = 0
cv_folds: int = 10
cv_stride: int = 1
cv_test_size: int = 4
cv_train_size: int = 12
cv_val_size: int = 2
datasets: dict[str, dict]
fixed_overrides: dict
hemisphere: Literal['NH', 'SH'] = 'NH'
name: str
output_prefix: str
plot_figures: bool = True
reference_conditions: dict[str, str]
sample_plot_steps: int = 336
save_condition_plots: bool = True
class twiga.experiment.Condition(name, group, description='', overrides=<factory>, model_cls=None, hpo_variant='', metric_types=<factory>, conformal_config=None, stage1_epochs_frac=None, calib_source='train_tail')

Bases: object

One experimental condition — what varies between backtesting runs.

Variables:
  • name – Short identifier used in filenames and summaries.

  • group – Experiment group this condition belongs to (e.g. "gating").

  • description – Human-readable note, shown in logs.

  • overrides – Key–value pairs applied to the model config after backbone HPO params and fixed overrides. These always win.

  • model_cls – Override the spec’s condition_cls for this condition. Use for multi-model experiments (e.g. MLPF vs MLPGAM vs MLPGAF).

  • metric_types – Which evaluation methods to call per fold. Each entry maps to one call: "point"evaluate_point_forecast; "interval"evaluate_interval_forecast; "quantile"evaluate_quantile_forecast. Defaults to ["point"].

  • conformal_config – When set the forecaster is given these conformal params and calib_size from the spec is used for calibration within each backtesting fold.

calib_source: str = 'train_tail'
conformal_config: ConformalConfig | None = None
description: str = ''
group: str
hpo_variant: str = ''
metric_types: list[str]
model_cls: type | None = None
name: str
overrides: dict
stage1_epochs_frac: float | None = None
twiga.experiment.run_backbone_hpo(backbone_cls, cfg, data, target_series, calendar_variables, exogenous_features, lags, latitude, longitude, dataset_key, hpo_cache_dir, hpo_variant='')

Run Optuna HPO for backbone_cls on a fixed 14-month / 2-month split.

Saves best params to <hpo_cache_dir>/<dataset_key>/<model_name>_best_params.json and returns the param dict. The file is shared across runs — params are never recomputed unless the file is deleted.

Parameters:
  • backbone_cls (type) – Model config class with a from_data_config factory.

  • cfg (PipelineConfig) – Pipeline config providing train_start, epochs, num_trials.

  • data (DataFrame) – Full dataset DataFrame (must have a timestamp column).

  • target_series (str) – Target variable name.

  • calendar_variables (list) – Calendar feature names.

  • exogenous_features (list) – Exogenous feature names.

  • lags (list) – Lag indices.

  • latitude (float) – Site latitude (used by some feature builders).

  • longitude (float) – Site longitude.

  • dataset_key (str) – Short dataset identifier used in the cache path.

  • hpo_cache_dir (Path) – Root directory for cached HPO params.

  • hpo_variant (str) – Optional suffix appended to the model name when computing the cache key (e.g. "3group"mlpgaf_3group_best_params.json). Allows a single config class to have per-variant HPO files.

Return type:

dict

Returns:

Best hyperparameter dict (same format as load_backbone_params()).

twiga.experiment.load_backbone_params(dataset_key, hpo_cache_dir, model_name_str, controlled_fields, fallback_paths=None)

Load saved backbone HPO params, strip controlled fields and model prefix.

Searches fallback_paths first (in order), then the canonical engine path. Returns an empty dict — with a warning — when no file is found.

Parameters:
  • dataset_key (str) – Short dataset identifier (e.g. "MLVS-PT").

  • hpo_cache_dir (Path) – Root directory for cached HPO params (typically <experiment_root>/backbone_hpo).

  • model_name_str (str) – Model name string (e.g. "mlpgaf").

  • controlled_fields (frozenset) – Keys to strip from the loaded params so ablation condition overrides always take precedence.

  • fallback_paths (list[Path] | None) – Additional JSON files to try before the canonical path.

Return type:

dict

Returns:

Dict of hyperparameter names → values, ready to setattr onto a model config object.

twiga.experiment.aggregate(combined, prefix, root, suffix, reference_conditions)

Compute mean ± std per (group, condition, metric_type) and save CSVs.

Parameters:
  • combined (DataFrame) – Long-form DataFrame with one row per fold/horizon, tagged with group, condition, dataset, and optionally metric_type columns.

  • prefix (str) – Filename prefix for output CSVs.

  • root (Path) – Directory to write <prefix>_full<suffix>.csv and <prefix>_summary<suffix>.csv.

  • suffix (str) – Optional tag appended to filenames (e.g. "_val").

  • reference_conditions (dict[str, str]) – Maps group → reference condition name for Δ-vs-reference columns.

Return type:

DataFrame

Returns:

Summary DataFrame with MultiIndex (group, condition[, metric_type]) and one column per metric, plus _std variants, n_runs, and Δ columns. Empty DataFrame if no recognised metric columns are present.


Experiment Tracking#

MLflow bridge used by :class:~twiga.experiment.ExperimentEngine. All helpers are safe no-ops when MLflow is absent or no tracking URI is configured.

twiga.experiment.detect_tracking_uri(explicit=None)

Return a tracking URI if MLflow tracking is configured, else None.

Priority: 1. explicit argument (caller-supplied). 2. MLFLOW_TRACKING_URI env var (standard MLflow convention). 3. TWIGA_MLFLOW_TRACKING_URI env var (Twiga-specific).

Returns None — never a default localhost — so callers know tracking is genuinely absent rather than pointed at an unreachable server.

Return type:

str | None

twiga.experiment.tracking.parent_run_context(tracking_uri, spec, run_id, dataset_keys, groups)

Open an MLflow parent run for the whole engine.run() call.

Yields the active run object, or None when MLflow is absent / unconfigured.

Return type:

Generator[Any, None, None]

twiga.experiment.tracking.hpo_run_context(dataset_key, model_name, n_trials)

Open an MLflow HPO child run under the active parent.

Yields the active run object, or None when no parent run is active.

Return type:

Generator[Any, None, None]

twiga.experiment.tracking.condition_run_context(dataset_key, group, condition_name, metric_types, model_type=None)

Open an MLflow condition child run under the active parent.

NN fold-grandchild runs are opened automatically inside BaseNeuralForecast._configure_logger() whenever this run is active.

Yields the active run object, or None when no parent run is active.

Return type:

Generator[Any, None, None]

twiga.experiment.tracking.log_hpo_result(best_params_path)

Log HPO best-params artifact to the active MLflow run.

Return type:

None

twiga.experiment.tracking.log_model_config_params(model_config)

Log effective model config (post-merge) as MLflow params on the active run.

Return type:

None

twiga.experiment.tracking.log_condition_results(metrics_df, metrics_csv_path=None)

Log aggregated fold metrics (mean ± std) and optional CSV artifact.

Called after forecaster.backtesting() returns, inside the condition child run context.

Return type:

None

twiga.experiment.tracking.log_experiment_summary(summary_path)

Log the cross-condition summary CSV as an artifact on the active parent run.

Return type:

None


Logging#

twiga.core.utils.configure(level='INFO', *, colour=True, log_file=None, file_level='DEBUG', capture_warnings=True)#

Activate Twiga logging. Call once from user code or experiment scripts.

Sets up a console handler (optionally colour-coded) and an optional file handler. Safe to call multiple times - existing handlers are cleared before new ones are attached.

Parameters:
  • level (str | int) – Console log level. Accepts level names ("DEBUG", "INFO", …) or integer constants (logging.DEBUG, …). Defaults to "INFO".

  • colour (bool) – Enable ANSI colour in console output. Automatically disabled when stdout is not a TTY (e.g. CI or redirected output). Defaults to True.

  • log_file (str | Path | None) – Optional path for a plain-text log file. Parent directory is created automatically if it does not exist. Defaults to None.

  • file_level (str | int) – Log level for the file handler. Defaults to "DEBUG" so full detail is always captured on disk even when the console shows only "INFO".

  • capture_warnings (bool) – Route warnings.warn() calls through the logging system. Defaults to True.

Return type:

Logger

Returns:

The configured root Twiga logging.Logger.

Raises:

ValueError – If level or file_level is not a recognised log-level string.

Example:

configure(level="DEBUG", log_file="results/run.log")
twiga.core.utils.get_logger(name)#

Return a named child of the Twiga root logger.

Call once at module level in every Twiga submodule:

log = get_logger(__name__)
Parameters:

name (str) – Dotted module name, typically __name__. Automatically prefixed with "twiga." if not already present.

Return type:

Logger

Returns:

A logging.Logger that inherits handlers from the Twiga root logger.