TwigaForecaster & Forecaster Architecture#

Source Files

twiga/forecaster/core.py - TwigaForecaster (user-facing entry point)
twiga/forecaster/abstract.py - AbstractForecaster (fit / predict / evaluate / tune orchestration)
twiga/forecaster/base.py - BaseForecaster (checkpointing, feature preparation, Optuna integration)
twiga/forecaster/registry.py - Dynamic model loading
twiga/forecaster/ensemble.py - Ensemble prediction strategies
twiga/forecaster/utils.py - Shape validation, DataFrame construction helpers
twiga/core/config/base.py - DataPipelineConfig, ForecasterConfig, ConformalConfig, BaseModelConfig

Overview#

TwigaForecaster is the primary interface for building, training, and evaluating time series forecasting models in Twiga. It accepts configuration objects, dynamically loads model implementations from a registry, and exposes a unified API that covers:

Multi-model training - fit one or many models (ML and/or neural network) in a single call.
Point and interval predictions - generate raw forecasts or conformal prediction intervals.
Evaluation - compute metrics against held-out data for both point and interval forecasts.
Backtesting - walk-forward cross-validation over expanding or rolling windows.
Hyperparameter tuning - Optuna-powered search with Hyperband pruning and TPE sampling.
Ensemble strategies - combine predictions from multiple models via mean, median, or weighted aggregation.

For a hands-on introduction see the Quick Start Guide.

Class Hierarchy#

        classDiagram
    direction TB

    class TimeBasedCV {
        +split_freq: str
        +train_size: int
        +test_size: int
        +gap: int
        +stride: int
        +window: str
        +split(data, start_dt, end_dt)
    }

    class AbstractForecaster {
        <<abstract>>
        +get_model_from_registry(model_params)
        +fit(train_df, val_df, train_ratio)
        +predict(test_df, ...) dict, dict
        +predict_interval(test_df, ...) dict, dict
        +evaluate(test_df, ...) DataFrame, DataFrame
        +evaluate_point_forecast(test_df, ...)
        +evaluate_interval_forecast(test_df, ...)
        +backtesting(data, ...)
        +tune(train_df, val_df, ...)
        #_fit()*
        #_predict()*
        #_evaluate()*
        #_tune()*
        #_backtester()*
        #_create_folder()*
    }

    class BaseForecaster {
        +models: list
        +data_pipeline: DataPipeline
        +conformals: dict
        +conformal_params
        +checkpoints_path: Path
        +logs_path: Path
        +results_path: Path
        +figures_path: Path
        +_create_folder()
        +on_save_checkpoint()
        +on_load_checkpoint()
        +_fit(train_df, val_df, train_ratio, trial)
        +_predict(test_df, covariate_df)
        +_evaluate(test_df, covariate_df)
        +_tune(train_df, val_df, ...)
        +_backtester(data, ...)
        +create_optuna_study(...)
        +prepare_test_data(test_df)
        +get_ground_truth(test_df)
    }

    class TwigaForecaster {
        +data_pipeline: DataPipeline
        +models: list
        +conformal: dict
        +domain: str
        +__init__(data_params, model_params, train_params, conformal_params)
        +calibrate(calibrate_df, ...)
    }

    TimeBasedCV <|-- BaseForecaster
    AbstractForecaster <|-- BaseForecaster
    BaseForecaster <|-- TwigaForecaster

AbstractForecaster defines the orchestration logic (loops over models, builds DataFrames), while BaseForecaster supplies the concrete single-model implementations (_fit, _predict, _evaluate, _tune, _backtester) together with checkpointing, feature preparation, and Optuna study creation. TwigaForecaster wires everything together through configuration objects.

Constructing a Forecaster#

`TwigaForecaster.init`#

from twiga.forecaster.core import TwigaForecaster

forecaster = TwigaForecaster(
    data_params=data_config,        # DataPipelineConfig
    model_params=[xgb_config],      # BaseModelConfig | list | dict | list[dict]
    train_params=train_config,      # ForecasterConfig
    conformal_params=conf_config,   # ConformalConfig | None
)

Parameter	Type	Description
`data_params`	`DataPipelineConfig`	Defines target features, feature engineering, scaling, and temporal settings for the `DataPipeline`.
`model_params`	`BaseModelConfig \| list[BaseModelConfig] \| dict \| list[dict]`	One or more model configurations. Each configuration is validated against the model’s registered config class. Accepts Pydantic models or plain dictionaries.
`train_params`	`ForecasterConfig`	Cross-validation split parameters (`split_freq`, `train_size`, `test_size`, `gap`, `window`), project naming, paths, and evaluation metrics.
`conformal_params`	`ConformalConfig` `\| None`	Optional. Enables conformal prediction intervals when provided. Requires a subsequent `calibrate()` call before `predict_interval`.

During construction the forecaster:

Copies the date_column from data_params into train_params.
Initialises the DataPipeline from data_params.
Iterates over model_params, calling get_model_from_registry to dynamically load each model class and instantiate it.

Mixing ML and NN models

You can pass both ML configs (e.g. XGBOOSTConfig) and neural network configs (e.g. NHITSConfig) in a single model_params list. The registry resolves each model independently from twiga.models.ml.* or twiga.models.nn.*. See Models for the full catalogue.

Core Workflow#

The typical lifecycle is configure -> fit -> predict / evaluate -> (optionally) backtest or tune.

        sequenceDiagram
    participant User
    participant TF as TwigaForecaster
    participant DP as DataPipeline
    participant Reg as ModelRegistry
    participant Model as Model(s)
    participant Conf as Conformal

    User->>TF: __init__(data_params, model_params, train_params, conformal_params)
    TF->>DP: DataPipeline(**data_params)
    TF->>Reg: get_model_from_registry(model_params)
    Reg-->>TF: [model_1, model_2, ...]

    User->>TF: fit(train_df, val_df)
    TF->>DP: fit(train_df)
    loop each model
        TF->>TF: _create_folder()
        TF->>DP: transform(train_split)
        TF->>Model: model.fit(features, targets)
        TF->>TF: on_save_checkpoint()
    end

    User->>TF: calibrate(calibrate_df)
    TF->>TF: predict(calibrate_df)
    loop each model
        TF->>Conf: Conformal.calibrate(forecast, ground_truth)
    end

    User->>TF: evaluate_point_forecast(test_df)
    TF->>TF: prepare_test_data(test_df)
    TF->>TF: get_ground_truth(test_df)
    loop each model
        TF->>DP: transform_features(test_df)
        TF->>Model: model.forecast(features)
        TF->>TF: _rescale_predictions(forecast)
    end
    TF-->>User: (results_df, metrics_df)

Method Reference#

Training#

Method	Signature	Description
`fit`	`fit(train_df, val_df=None, train_ratio=1.0) -> None`	Fits the data pipeline (if not already fitted) and trains every registered model. For each model it creates the artifact directory, prepares features via `_prepare_features`, and delegates to `model.fit`.

The fit method applies temporal filtering internally: it keeps only the last lookback_window_size + max_data_drop rows of training data to limit memory use.

Note

The data pipeline is fitted once on the first call. Subsequent calls to fit (e.g. during backtesting) skip the pipeline fit if it has already been initialised.

Prediction#

Method	Signature	Returns
`predict`	`predict(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prepare_test_data=True)`	`tuple[dict[str, np.ndarray], dict[str, float]]`
`predict_interval`	`predict_interval(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prepare_test_data=True)`	`tuple[dict[str, tuple[np.ndarray, np.ndarray, np.ndarray]], dict[str, float]]`
`forecast`	`forecast(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)`	`ForecastCollection`

predict and predict_interval return a two-element tuple:

Prediction dictionary - keys are model names (and optionally "Ensemble"), values are 3-D NumPy arrays (num_samples, horizon, num_targets) for point forecasts, or (lower, point, upper) tuples for intervals.
Inference time dictionary - keys are model names, values are wall-clock seconds.

forecast is the higher-level alternative: it wraps each model’s output in a typed ForecastResult and returns a ForecastCollection. Ground truth is automatically extracted from test_df and attached to every ForecastResult. Use forecast when you want structured downstream access (.to_dataframe(), .evaluate(), etc.).

collection = forecaster.forecast(test_df)
result = collection["xgboost"]          # ForecastResult
df = collection.to_dataframe()          # tidy long-format DataFrame
metrics = collection.evaluate()         # metrics DataFrame across all models

predict_interval requires that conformal_params was supplied at construction and that calibrate() has been called.

Warning

Calling predict_interval before calibrate() raises a ValueError. Always calibrate on a held-out calibration set that was not used during training.

Calibration#

Method	Signature	Description
`calibrate`	`calibrate(calibrate_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)`	Generates point predictions on the calibration set and fits a `Conformal` model per registered model, storing them in `self.conformal`.

See Conformal Prediction for the available method and score_type options in ConformalConfig.

Evaluation#

Method	Signature	Returns
`evaluate`	`evaluate(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prediction_fn=..., evaluation_fn=..., is_interval=False)`	`tuple[pd.DataFrame, pd.DataFrame]`
`evaluate_point_forecast`	`evaluate_point_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)`	`tuple[pd.DataFrame, pd.DataFrame]`
`evaluate_interval_forecast`	`evaluate_interval_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)`	`tuple[pd.DataFrame, pd.DataFrame]`

evaluate_point_forecast and evaluate_interval_forecast are convenience wrappers around evaluate that pre-bind the appropriate prediction and evaluation functions.

When using forecast(), evaluation is also available directly on the returned ForecastCollection:

Method	Signature	Returns
`ForecastCollection.evaluate`	`evaluate(**kwargs)`	`pd.DataFrame`

ForecastCollection.evaluate() calls ForecastResult.evaluate() on each result in the collection and concatenates the metrics into a single DataFrame. This is the preferred path when you have already called forecast(), since ground truth is attached automatically.

The returned DataFrames have the following structure:

Results DataFrame (results_df):

Column	Description
index (timestamp)	Date column set as the index
`Model`	Model name (uppercased)
`target`	Target variable name
`forecast`	Point forecast value
`Actual`	Ground truth value
`lower`	Lower bound (interval evaluation only)
`upper`	Upper bound (interval evaluation only)

Metrics DataFrame (metrics_df):

Column	Description
`Model`	Model name (uppercased)
`target`	Target variable name
metric columns	One column per metric (e.g. `mae`, `rmse`, `smape`, `corr`)
`inference-time`	Wall-clock prediction time in seconds

See Metrics for the full list of supported evaluation metrics.

Backtesting#

Method	Signature	Returns
`backtesting`	`backtesting(data, train_ratio=1.0, start_dt=None, end_dt=None, verbose=True, trial=None, ensemble_strategy=None, ensemble_weights=None)`	`tuple[pd.DataFrame, pd.DataFrame]`

Backtesting performs walk-forward cross-validation by iterating over temporal splits produced by the inherited TimeBasedCV.split() method. For each fold it:

Calls fit(train_df) to retrain on the expanding (or rolling) training window.
Calls evaluate_point_forecast(test_df) on the held-out test window.
Tags results with a Folds column.

The split behaviour is controlled by ForecasterConfig parameters:

Parameter	Type	Default	Description
`split_freq`	`str`	`"months"`	Time unit: `"days"`, `"hours"`, `"weeks"`, `"months"`, `"years"`
`train_size`	`int`	`1`	Length of the training window in `split_freq` units
`test_size`	`int`	`1`	Length of the test window in `split_freq` units
`gap`	`int`	`0`	Gap between training end and test start
`stride`	`int \| None`	`None`	Step size between folds (defaults to `test_size`)
`window`	`str`	`"expanding"`	`"expanding"` or `"rolling"`

See Backtesting for a detailed explanation of the cross-validation scheme.

Hyperparameter Tuning#

Method	Returns
`tune`	`None`

forecaster.tune(
    train_df, val_df,
    num_trials=10, reduction_factor=3, patience=10,
    load_if_exists=True, initial_params=None,
    direction="minimize", sampler=None, base_pruner=None,
    objective_metric=None, calib_df=None, conformal_params=None,
)

tune performs Optuna-based hyperparameter optimisation for every registered model. For each model it:

Creates an Optuna study (or loads an existing one from a JournalFileBackend log).
Configures a HyperbandPruner wrapped in a PatientPruner and a TPESampler (seeded with self.seed).
Enqueues initial_params if provided.
Runs num_trials trials. Each trial calls model.update(trial) to sample hyperparameters, then _fit and _evaluate to compute a cost.
Replaces each model instance with a new one instantiated from the best parameters.

Interval-metric tuning (Strategy 1): Pass calib_df alongside conformal_params to extend each trial with a calibrate → evaluate-interval step. The trial score then reflects an interval quality metric (e.g. PICP, PINAW) instead of a point-forecast error. Use objective_metric to select which column of the metrics DataFrame to optimise; defaults to the first entry in self.metrics or "mae".

conformal_params overrides self.conformal_params for the duration of the tuning loop only — the original value is restored afterward.

After tuning, call fit again to train with the optimised configuration.

Resuming studies

When load_if_exists=True (the default) the study is persisted to {logs_path}/{project_name}_{model_type}.log. Re-running tune continues from where the previous run left off, which is useful for incremental search.

See Hyperparameter Tuning for advanced recipes.

Model Registry#

Models are loaded dynamically at construction time through the registry in twiga/forecaster/registry.py. The registry uses the name field from each config to resolve a module at twiga.models.{domain}.{name}_model, then retrieves {NAME}Model and {NAME}Config from that module.

# Internally called during TwigaForecaster.__init__
model_cls, config_cls = get_model("xgboost", domain="ml")
# Imports twiga.models.ml.xgboost_model -> (XGBOOSTModel, XGBOOSTConfig)

Results are cached after the first load. If domain is not specified, the registry searches both ml and nn directories.

When configs are supplied as dictionaries, get_model_from_dict extracts the "name" key and follows the same resolution path:

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params={"name": "lightgbm", "domain": "ml"},
    train_params=train_config,
)

See Models for all available model implementations.

Ensemble Strategies#

When more than one model is registered and an ensemble_strategy is passed to predict, predict_interval, or the evaluation methods, predictions are combined into an additional "Ensemble" entry.

Strategy	`EnsembleStrategy`	Formula
Mean	`"mean"`	Element-wise mean across models
Median	`"median"`	Element-wise median across models
Weighted	`"weighted"`	`np.einsum("ntcm,m->ntc", preds, weights)` with normalised weights

For the weighted strategy, pass a dict[str, float] mapping model names to weights:

predictions, times = forecaster.predict(
    test_df=test_df,
    ensemble_strategy="weighted",
    ensemble_weights={"xgboost": 0.6, "lightgbm": 0.4},
)

For interval predictions, the ensemble is computed independently over the lower, point, and upper arrays.

Checkpointing & Directory Structure#

BaseForecaster._create_folder() creates a standardised directory layout under root_dir:

{root_dir}/
  results/{project_name}/{model_type}/
  logs/{project_name}/{model_type}/
  figures/{project_name}/{model_type}/
  checkpoints/{project_name}/{model_type}/[file_name]/

Checkpoint persistence differs by domain:

Domain	Save	Load
`ml`	`joblib.dump` of `{"model": ..., "data_pipeline": ...}`	`joblib.load` of the most recent file in `checkpoints_path`
`nn`	Delegated to the model’s own save logic	`model.load_checkpoint()`, then copies `model.data_pipeline`

Inverse Scaling#

BaseForecaster._rescale_predictions automatically reverses the target scaling applied by the data pipeline. It supports three input formats:

Format	Handling
`np.ndarray`	Direct inverse transform via the pipeline’s target scaler
`tuple[np.ndarray, ...]`	Each element is inverse-transformed individually; 4-D arrays (e.g. quantile or sample dimensions) are reshaped appropriately
`dict` with `"loc"` key	`"loc"` is inverse-transformed; `"scale"` uses scale-only inversion; `"quantiles"` and `"samples"` are reshaped before inversion

API Reference#

class twiga.forecaster.core.TwigaForecaster(data_params, model_params, train_params, conformal_params=None)#

Bases: BaseForecaster

Machine Learning Forecaster for time series predictions.

This forecaster initializes a data pipeline and dynamically loads machine learning models based on provided configurations. The configurations can be specified as Pydantic models or dictionaries. Once the models are loaded, they can be trained, evaluated, and backtested.

Example

>>> from twiga.core.config import BaseModelConfig, DataPipelineConfig, ForecasterConfig
>>> data_params = DataPipelineConfig(date_column="date", ...)
>>> model_config = BaseModelConfig(name="linear", ...)
>>> train_params = ForecasterConfig(...)
>>> forecaster = TwigaForecaster(data_params, model_config, train_params)
>>> forecaster.fit(train_df)
>>> predictions, metrics = forecaster.evaluate_point_forecast(test_df)

__init__(data_params, model_params, train_params, conformal_params=None)#

Initialize TwigaForecaster.

Parameters:

data_params (DataPipelineConfig) – Configuration for the data pipeline.
model_params (BaseModelConfig | list[BaseModelConfig] | dict | list[dict]) – Configuration for the model(s). Can be a single Pydantic config, a dictionary, or a list of either. Neural network configs with unset dims (num_target_feature, forecast_horizon, lookback_window_size equal to 0) are auto-populated from data_params. Base arch configs with a distribution field set are automatically resolved to the corresponding probabilistic variant (e.g. MLPFConfig(distribution='normal') becomes an MLPFNormalConfig).
train_params (ForecasterConfig) – Training configuration parameters.
conformal_params (ConformalConfig | None) – Optional conformal prediction configuration.

calibrate(calibrate_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Calibrate conformal prediction models using calibration data.

Parameters:

calibrate_df (DataFrame | None) – Calibration dataset. If None, uses the stored training data.
covariate_df (DataFrame | None) – Optional covariate dataset.
ensemble_strategy (str | None) – Strategy for combining model predictions.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy.

Raises:

ValueError – If conformal_params is not set.

Return type:

None

explain(X, model_idx=0, n_background=100)#

Compute SHAP feature attributions for a fitted ML model.

Builds a ShapExplainer for the model at position model_idx in self.models, runs SHAP over X, and returns a ShapResult with values reshaped to (B, L, F) - one attribution per sample, per lookback step, per feature.

Only ML models (domain="ml") are supported. Neural-network models require gradient-based attribution and are not currently handled.

Parameters:

X (ndarray) – Feature array of shape (B, L, F) as produced by the data pipeline (e.g. from DataPipeline.transform()).
model_idx (int) – Index into self.models of the model to explain. Defaults to 0 (the first / only model).
n_background (int) – Number of background samples for LinearExplainer and KernelExplainer. Ignored for tree models.

Return type:

ShapResult

Returns:

ShapResult with –

values - SHAP array (B, L, F)
feature_names - original F feature names
timestep_labels - L lookback labels ('t-L+1' … 't0')
expected_value - SHAP base value (mean prediction)

Raises:

IndexError – If model_idx is out of range.
RuntimeError – If no models are fitted or the domain is not "ml".
ImportError – If shap is not installed.

Example

>>> result = forecaster.explain(X_test)
>>> result.plot_importance(top_n=20)
>>> importance = result.mean_importance()

Base Classes#

class twiga.forecaster.abstract.AbstractForecaster#

Bases: ABC

Abstract base class for time series forecasters with default implementations.

Provides implementations for fitting, evaluating, and tuning models. Subclasses must implement model-specific methods.

backtesting(data, train_ratio=1.0, start_dt=None, end_dt=None, verbose=True, trial=None, ensemble_strategy=None, ensemble_weights=None)#

Perform backtesting on the forecaster models.

Parameters:

data (DataFrame) – Complete dataset for backtesting.
train_ratio (float) – Ratio of data to use for training in backtesting.
start_dt (object | None) – Start date for backtesting, if any.
end_dt (object | None) – End date for backtesting, if any.
verbose (bool) – Whether to display detailed logs during backtesting.
trial (object | None) – Trial identifier or configuration, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.

Return type:

tuple[DataFrame, DataFrame]

Returns:

Tuple of concatenated predictions and backtesting metrics.

Raises:

ValueError – If no models are available or no metrics are returned.

evaluate(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prediction_fn=None, evaluation_fn=None, is_interval=False)#

Evaluate forecaster models on test data, supporting point or interval forecasts.

Parameters:

test_df (DataFrame | None) – Test dataset, if any. Uses default test data if None.
covariate_df (DataFrame | None) – Dataset with additional covariates, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.
prediction_fn (Callable | None) – Function to generate predictions (e.g., predict or predict_interval).
evaluation_fn (Callable | None) – Function to evaluate predictions (e.g., evaluate_point_forecast or evaluate_interval_forecast).
is_interval (bool) – Whether the evaluation is for interval forecasts (adds lower/upper bounds).

Return type:

tuple[DataFrame, DataFrame]

Returns:

Tuple of –

DataFrame with columns: timestamp, Model, target, forecast, Actual, [lower, upper] (if is_interval=True).
DataFrame with columns: Model, target, MetricName, Value, inference-time.

Raises:

ValueError – If ground truth, timestamp, or prediction shapes do not match.

evaluate_interval_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Evaluate interval forecasts (wrapper for evaluate).

Parameters:

test_df (DataFrame | None) – Test dataset, if any. Uses default test data if None.
covariate_df (DataFrame | None) – Dataset with additional covariates, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.

Return type:

tuple[DataFrame, DataFrame]

Returns:

Tuple of – - DataFrame with columns: timestamp, Model, target, forecast, Actual, lower, upper. - DataFrame with columns: Model, target, MetricName, Value, inference-time.

evaluate_parametric_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Evaluate parametric probabilistic forecasts, computing NLL and point metrics.

Works with Gaussian ML models (GAUSSCATBOOSTConfig) and neural parametric heads (MLPFConfig(distribution="normal"), "laplace", "gamma", etc.).

NLL is computed under a Normal distribution assumption unless the model supplies a "log_likelihood" key (neural parametric heads do this automatically via ForecastResult).

Return type:: tuple[DataFrame, DataFrame]
Returns:: tuple – Pair of DataFrames. First has columns: timestamp, Model, target, forecast (mean), Actual. Second has per-day per-target metrics: nll, plus standard point metrics (mae, rmse, …).

evaluate_point_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Evaluate point forecasts (wrapper for evaluate).

Parameters:

test_df (DataFrame | None) – Test dataset, if any. Uses default test data if None.
covariate_df (DataFrame | None) – Dataset with additional covariates, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.

Return type:

tuple[DataFrame, DataFrame]

Returns:

Tuple of – - DataFrame with columns: timestamp, Model, target, forecast, Actual. - DataFrame with columns: Model, target, MetricName, Value, inference-time.

evaluate_quantile_forecast(test_df=None, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Evaluate quantile forecasts, computing pinball loss, calibration error, and sharpness.

Requires QR models (e.g. QRXGBOOSTConfig, MLPGAMConfig(distribution="qr")).

Return type:: tuple[DataFrame, DataFrame]
Returns:: tuple – Pair of DataFrames. First has columns: timestamp, Model, target, forecast (median), Actual. Second has per-day per-target metrics: pinball, calibration_error, sharpness, plus standard point metrics (mae, rmse, …).

fit(train_df, val_df=None, train_ratio=1.0)#

Fit the forecaster models using training and optional validation data.

Parameters:

train_df (DataFrame) – Training dataset.
val_df (DataFrame | None) – Validation dataset, if any.
train_ratio (float) – Ratio of data to use for training.

Return type:

None

forecast(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None)#

Generate predictions and return them as a typed ForecastCollection.

Unlike predict(), this method wraps each model’s output in a ForecastResult with timestamps and target names populated, enabling structured downstream access (to_dataframe(), evaluate(), etc.).

The test data must contain the target column(s) so that timestamps can be aligned with the pipeline’s sequence layout. For forward-looking prediction where actuals are unavailable, use predict() directly.

Parameters:

test_df (DataFrame) – Test dataset. Must include the target column(s).
covariate_df (DataFrame | None) – Optional covariate dataset.
ensemble_strategy (str | None) – Strategy for combining model predictions.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy.

Return type:

ForecastCollection

Returns:

ForecastCollection containing one ForecastResult per model (plus an optional "Ensemble" entry when ensemble_strategy is set).

abstractmethod get_ground_truth(test_df, **kwargs)#

Retrieve ground truth data.

Parameters:

test_df (DataFrame) – Test dataset.
**kwargs – Additional keyword arguments for subclass implementations.

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (timestamps, ground truth) as NumPy arrays.

get_model_from_registry(model_params)#

Load and instantiate models based on the provided configurations.

Parameters:

model_params (list[BaseModelConfig] | list[dict]) – List of model configurations as Pydantic models or dictionaries.

Raises:

TypeError – If model_params contains invalid types or mismatched configurations.
ValueError – If a dictionary configuration lacks a model name.

Return type:

None

predict(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prepare_test_data=True)#

Generate point predictions using the ensemble of forecasting models.

Parameters:

test_df (DataFrame) – Test dataset.
covariate_df (DataFrame | None) – Dataset with additional covariates, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.
prepare_test_data (bool) – Whether to preprocess the test data.

Return type:

tuple[dict[str, ndarray], dict[str, float]]

Returns:

Tuple of – - Dictionary mapping model names to 3D NumPy array predictions. - Dictionary mapping model names to inference times.

Raises:

ValueError – If predictions are not 3D NumPy arrays.

predict_interval(test_df, covariate_df=None, ensemble_strategy=None, ensemble_weights=None, prepare_test_data=True)#

Generate conformal interval predictions for the given test data.

Parameters:

test_df (DataFrame) – Test dataset.
covariate_df (DataFrame | None) – Dataset with additional covariates, if any.
ensemble_strategy (str | None) – Strategy for combining model predictions, if any.
ensemble_weights (dict[str, float] | None) – Weights for weighted ensemble strategy, if any.
prepare_test_data (bool) – Whether to preprocess the test data.

Return type:

tuple[dict[str, tuple[ndarray, ndarray, ndarray]], dict[str, float]]

Returns:

Tuple of – - Dictionary mapping model names to tuples of (lower, forecast, upper) arrays. - Dictionary mapping model names to inference times.

Raises:

ValueError – If conformal parameters or models are not set, or predictions are not 3D NumPy arrays.

abstractmethod prepare_test_data(test_df)#

Prepare test data for evaluation.

Parameters:: test_df (DataFrame | None) – Test dataset, if any.
Return type:: DataFrame
Returns:: Processed test DataFrame.

tune(train_df, val_df, num_trials=10, reduction_factor=3, patience=10, load_if_exists=True, initial_params=None, direction='minimize', sampler=None, base_pruner=None, objective_metric=None, calib_df=None, conformal_params=None)#

Perform hyperparameter tuning and update models with optimal parameters.

Parameters:

train_df (DataFrame) – DataFrame containing training data.
val_df (DataFrame) – DataFrame containing validation data.
num_trials (int) – Number of trials for hyperparameter tuning.
reduction_factor (int) – Reduction factor for the tuning pruner.
patience (int) – Patience for the tuning process.
load_if_exists (bool) – If True, load an existing tuning study if available.
initial_params (dict | None) – Initial parameters for tuning.
direction (str) – Direction of optimization, either ‘minimize’ or ‘maximize’.
sampler (object | None) – Sampler object for hyperparameter sampling.
base_pruner (object | None) – Pruner object for early stopping.
objective_metric (str | None) – Column name in the evaluation metrics DataFrame to optimise. None defaults to the first metric in self.metrics (or 'mae' as a fallback). Example: 'rmse', 'smape'.
calib_df (DataFrame | None) – Hold-out calibration dataset. When provided alongside conformal parameters, each trial fits → calibrates → evaluates interval metrics so that objective_metric can target an interval score (e.g. 'winkler', 'picp').
conformal_params (ConformalConfig | None) – Conformal prediction configuration to use during tuning. If None, uses the forecaster’s existing conformal_params attribute. Ignored when calib_df is None.

Return type:

None

class twiga.forecaster.base.BaseForecaster(split_freq='months', test_size=1, train_size=1, gap=0, domain='ml', stride=None, window='expanding', date_column='timestamp', num_splits=None, project_name='experiment', file_name=None, seed=42, root_dir='../', metrics=None, checkpoints_path=None)#

Bases: TimeBasedCV, AbstractForecaster, ABC

Base forecaster class that provides model training, checkpointing,prediction, and evaluation capabilities.

Example

>>> class MyForecaster(BaseForecaster):
...     def predict(self, test_df: pd.DataFrame, covariate_df: pd.DataFrame | None) -> dict:
...         # Implement prediction logic here
...         return {"loc": np.zeros((10, 1))}
>>> forecaster = MyForecaster(split_freq="months", test_size=1, train_size=1, project_name="my-project")
>>> forecaster.fit(train_df)
>>> results_df, metrics_df = forecaster.evaluate(test_df)

__init__(split_freq='months', test_size=1, train_size=1, gap=0, domain='ml', stride=None, window='expanding', date_column='timestamp', num_splits=None, project_name='experiment', file_name=None, seed=42, root_dir='../', metrics=None, checkpoints_path=None)#

Initialize the BaseForecaster.

Parameters:

split_freq (str) – Frequency of the data, e.g., “months”.
test_size (int) – Number of periods to forecast.
train_size (int) – Training window size.
gap (int) – Gap between training and forecast periods.
stride (int | None) – Step size between splits.
window (str) – Window type (e.g., “expanding”).
date_column (str) – Name of the timestamp column.
domain (str) – Domain of the data (e.g., “ml”).
num_splits (int | None) – Number of splits for cross-validation.
project_name (str) – Experiment name.
file_name (str | None) – Optional file name for checkpoints.
seed (int) – Random seed.
root_dir (str) – Root directory for results.
trial (Any | None) – Optional trial object (for hyperparameter optimization).
metrics (tuple[str] | list[str] | None) – List of metrics to evaluate.
checkpoints_path (str | None) – Explicit checkpoint directory. When set, overrides the path derived from root_dir/project_name/model_type and is available immediately - before fit() is called.

create_optuna_study(num_trials=10, reduction_factor=3, patience=2, load_if_exists=True, base_pruner=None, sampler=None, direction='minimize')#

Create or load an Optuna study for hyperparameter optimization.

This method configures an Optuna study with a Hyperband pruner wrapped by a PatientPruner (if no custom pruner is provided) and a TPESampler with the instance’s seed (if no custom sampler is provided). The study is stored in a SQLite database within the results directory.

Parameters:

num_trials (int) – Number of trials for the study (unused in study setup).
reduction_factor (int) – Reduction factor for HyperbandPruner.
patience (int) – Patience for PatientPruner.
load_if_exists (bool) – Whether to load an existing study if it exists.
base_pruner (Any | None) – Custom pruner to use instead of the default.
sampler (Any | None) – Custom sampler to use instead of the default.
direction (str) – Direction of optimization, either “minimize” or “maximize”.

Returns:

optuna.Study – Configured Optuna study.

create_results_df(time_stamp, ground_truth, predictions, target_feature, date_column)#

Create a results DataFrame with the timestamp index, ground truth, and forecasted values.

Parameters:

time_stamp (ndarray) – Timestamp index.
ground_truth (ndarray) – Ground truth values.
predictions (ndarray) – Forecasted values.
target_feature (list[str]) – List of target series names.
date_column (str) – Name of the date column.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with ground truth and forecasted values indexed by timestamp.

Example

>>> results_df = forecaster.create_results_df(ts, gt, pred, ["temp"], "Date")

get_ground_truth(test_df=None)#

Load the latest checkpoint and return inverse-scaled ground truth sequences.

Side effect: calls load_checkpoint_and_datapipe(), which calls on_load_checkpoint() and may overwrite self.model and self.data_pipeline from disk. Use this method during evaluation (where the checkpoint state is desired). For pure data extraction without checkpoint side effects - e.g. inside forecast() - use self.data_pipeline.get_ground_truth_sequences() directly instead.

Parameters:: test_df (DataFrame | None) – Test DataFrame. If None, uses self.train_df.
Return type:: tuple[ndarray, ndarray]
Returns:: tuple – Pair of (time_stamp, ground_truth). time_stamp is aligned with prediction sequences; ground_truth is the inverse-scaled array of shape (num_sequences, forecast_horizon, num_targets).

load_checkpoint_and_datapipe()#

Load the model checkpoint and restore the data pipeline from disk.

Return type:: None

prepare_test_data(test_df=None)#

Prepares and returns a formatted test DataFrame for prediction.

This method checks the consistency of the training DataFrame and test DataFrame, concatenates them if both are provided, and sorts the resulting DataFrame by the date column defined in the data pipeline.

Parameters:

test_df (pandas.DataFrame, optional) – A DataFrame containing test data. If not provided, the training DataFrame is used as test data.

Raises:

ValueError – If the training DataFrame length does not match the expected lookback window size plus maximum data drop.
ValueError – If neither training data nor test data is provided.

Returns:

pandas.DataFrame – A sorted DataFrame ready for prediction.

Forecast Result Types#

class twiga.forecaster.result.ForecastResult(timestamps, loc, targets, model_name, kind, ground_truth=None, scale=None, quantiles=None, quantile_levels=None, conf_level=None, samples=None, lower=None, upper=None, inference_time=0.0)#

Bases: object

Container for one model’s forecast output.

Variables:

timestamps – shape (n_batch, n_horizon, n_targets)
loc – point predictions (mean/median), shape (n_batch, n_horizon, n_targets)
targets – ordered list of target variable names
model_name – human-readable model identifier
kind – determines which optional arrays are expected and how to convert
ground_truth – optional, same shape as loc
scale – parametric std-dev / scale, same shape as loc
quantiles – shape (n_batch, n_q, n_horizon, n_targets)
quantile_levels – corresponding probability levels (e.g. [0.1, 0.5, 0.9])
samples – shape (n_batch, n_samples, n_horizon, n_targets)
lower – lower bound, same shape as loc
upper – upper bound, same shape as loc
inference_time – inference duration in seconds
conf_level
metric_name

conf_level: list[float] | ndarray | None = None#

evaluate(ground_truth=None, **kwargs)#

Evaluate forecast against ground truth using kind-appropriate metrics.

Forwards to twiga.core.metrics.evaluate_forecast().

Parameters:

ground_truth (ndarray | None) – shape (n_batch, n_horizon, n_targets). When omitted the ground_truth stored on the result is used.
**kwargs – forwarded to the underlying evaluate function.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target metrics.

Raises:

ValueError – if no ground truth is available.

ground_truth: ndarray | None = None#

inference_time: float = 0.0#

kind: ForecastKind#

loc: ndarray#

lower: ndarray | None = None#

model_name: str#

quantile_levels: list[float] | ndarray | None = None#

quantiles: ndarray | None = None#

samples: ndarray | None = None#

scale: ndarray | None = None#

targets: list[str]#

timestamps: ndarray#

to_dataframe(fmt='long')#

Convert forecast to tidy DataFrame.

Always includes: timestamp, target, model, forecast. Optional: actual (when ground_truth is present).

Additional columns depend on forecast kind:

POINT: no extra columns
PARAMETRIC: scale
INTERVAL: lower, upper
QUANTILE (fmt=”wide”): q_0.10, q_0.50, …
QUANTILE (fmt=”long”): q_level, quantile_forecast
SAMPLES: q_0.10, q_0.50, q_0.90 (empirical quantiles)

Parameters:: fmt (str) – “long” (default) or “wide” - only affects QUANTILE
Return type:: DataFrame
Returns:: pandas DataFrame in long or wide format
Raises:: ValueError – if fmt is invalid

upper: ndarray | None = None#

class twiga.forecaster.result.ForecastCollection(results=<factory>)#

Bases: object

Collection of ForecastResult objects from multiple models.

add(result)#

Add or replace result using its model_name as key.

Return type:: None

evaluate(**kwargs)#

Evaluate all models and return a combined metrics DataFrame.

Calls ForecastResult.evaluate() on each result and concatenates the output, adding a "Model" column derived from each result’s model_name. Ground truth must be attached to each result (i.e. forecast() must have been called with test data that contains the target column).

Parameters:: **kwargs – Forwarded to each ForecastResult.evaluate() call (e.g. metric_names, freq).
Return type:: DataFrame
Returns:: Combined metrics DataFrame with a "Model" column.
Raises:: ValueError – If the collection is empty or any result lacks ground truth.

property model_names: list[str]#

results: dict[str, ForecastResult]#

to_dataframe(fmt='long')#

Concatenate all model forecasts into one DataFrame.

Parameters:: fmt (str) – passed to each ForecastResult.to_dataframe()
Return type:: DataFrame
Returns:: Combined long-format DataFrame
Raises:: ValueError – if collection is empty

class twiga.forecaster.result.ForecastKind(*values)#

Bases: StrEnum

Supported forecast output types.

Values are strings and can be used directly as dict keys.

INTERVAL = 'interval'#

PARAMETRIC = 'parametric'#

POINT = 'point'#

QUANTILE = 'quantile'#

SAMPLES = 'samples'#

Registry#

twiga.forecaster.registry.get_model(name, domain=None)#

Lazily load the model and config classes from models/ml/ or models/nn/.

Parameters:

name (str) – The name of the model (e.g., “linear”, “lstm”).
domain (str | None) – The specific domain to look in (“ml” or “nn”). If None, searches both.

Return type:

tuple[type, type]

Returns:

tuple[Type, Type] – A tuple of (model_class, config_class).

Raises:

ValueError – If the model is not found in the specified or default domains.

Ensemble#

twiga.forecaster.ensemble.compute_ensemble_predictions(predictions, model_names, ensemble_strategy, ensemble_weights=None)#

Generate ensemble predictions by combining predictions from multiple models.

Parameters:

predictions (list[ndarray]) – List of model predictions, where each prediction is a 3D NumPy array with shape (num_samples, horizon, num_targets).
model_names (list[str]) – List of model names corresponding to the predictions.
ensemble_strategy (EnsembleStrategy) – Strategy for combining predictions, one of EnsembleStrategy.MEAN, EnsembleStrategy.MEDIAN, or EnsembleStrategy.WEIGHTED.
ensemble_weights (dict[str, float] | None) – Dictionary mapping model names to their weights for the weighted ensemble strategy. Required if ensemble_strategy is EnsembleStrategy.WEIGHTED. Defaults to None.

Return type:

ndarray

Returns:

A 3D NumPy array of ensemble predictions with shape (num_samples, horizon, num_targets).

Raises:

ValueError – If predictions is empty, prediction shapes are inconsistent, weights are required but not provided, the number of weights does not match the number of models, or the ensemble strategy is unknown.

TwigaForecaster & Forecaster Architecture#

Overview#

Class Hierarchy#

Constructing a Forecaster#

TwigaForecaster.__init__#

Core Workflow#

Method Reference#

Training#

Prediction#

Calibration#

Evaluation#

Backtesting#

Hyperparameter Tuning#

Model Registry#

Ensemble Strategies#

Checkpointing & Directory Structure#

Inverse Scaling#

API Reference#

Base Classes#

Forecast Result Types#

Registry#

Ensemble#

See Also#

`TwigaForecaster.init`#