Backtesting & Time-Based Cross-Validation#

Source Files
  • twiga/core/backtester.py

  • twiga/forecaster/core.py

Standard k-fold cross-validation does not work for time series because it breaks temporal ordering - a model could train on future data and predict the past. Twiga implements time-based cross-validation through the TimeBasedCV class, which generates chronologically ordered train/test splits.

How It Works#

        graph TD
    subgraph "Rolling Window Strategy"
        A["Fold 1: Train [t0..t3] → Test [t3..t4]"]
        B["Fold 2: Train [t1..t4] → Test [t4..t5]"]
        C["Fold 3: Train [t2..t5] → Test [t5..t6]"]
    end

    subgraph "Expanding Window Strategy"
        D["Fold 1: Train [t0..t3] → Test [t3..t4]"]
        E["Fold 2: Train [t0..t4] → Test [t4..t5]"]
        F["Fold 3: Train [t0..t5] → Test [t5..t6]"]
    end
    
  • Rolling window: Training window has a fixed size and slides forward each fold

  • Expanding window: Training window starts from the beginning and grows each fold

Class Hierarchy#

        classDiagram
    class TimeBasedSplit {
        <<abstract>>
        +split_freq: str
        +train_size: int
        +test_size: int
        +gap: int
        +stride: int
        +window: str
        +train_delta: relativedelta
        +forecast_delta: relativedelta
        +gap_delta: relativedelta
        +stride_delta: relativedelta
        #_splits_from_period()
        +split()*
    }

    class TimeBasedCV {
        +date_column: str
        +num_splits: int
        +split(data, start_dt, end_dt)
        +get_splits(data)
        +set_split_scheme()
        +get_scheme()
        +plot_split_scheme()
    }

    class SplitState {
        +train_start: Timestamp
        +train_end: Timestamp
        +forecast_start: Timestamp
        +forecast_end: Timestamp
    }

    TimeBasedSplit <|-- TimeBasedCV
    TimeBasedSplit ..> SplitState : creates
    

Configuration#

Backtesting behavior is controlled by ExperimentConfig parameters:

Parameter

Type

Default

Description

split_freq

Literal["days", "minutes", "hours", "weeks", "months", "years"]

"months"

Unit for train/test/gap/stride sizes

train_size

int

1

Training window length (in split_freq units)

test_size

int

1

Test window length (in split_freq units)

gap

int

0

Gap between training end and test start

stride

int | None

None

Step size between folds (defaults to test_size)

window

Literal["expanding", "rolling"]

"expanding"

Window strategy

num_splits

int | None

None

Maximum number of splits (None = all possible)

Example Configuration#

from twiga.core.config import ExperimentConfig

# Monthly backtesting: 6 months train, 1 month test, expanding window
config = ExperimentConfig(
    split_freq="months",
    train_size=6,
    test_size=1,
    gap=0,
    window="expanding",
)

# Daily backtesting: 14 days train, 7 days test, rolling window
config = ExperimentConfig(
    split_freq="days",
    train_size=14,
    test_size=7,
    gap=0,
    window="rolling",
    stride=7,  # move 7 days between folds
)

Using TimeBasedCV Directly#

The TimeBasedCV class can be used independently for custom splitting logic:

from twiga.core.backtester import TimeBasedCV

cv = TimeBasedCV(
    split_freq="months",
    train_size=6,
    test_size=1,
    gap=0,
    window="expanding",
    date_column="timestamp",
)

for bundle in cv.split(data):
    print(f"Fold {bundle.split_key + 1}:")
    print(f"  Train: {bundle.scheme['train_period'][0]} to {bundle.scheme['train_period'][1]}")
    print(f"  Test:  {bundle.scheme['test_period'][0]} to {bundle.scheme['test_period'][1]}")

The split() method yields SplitBundle named tuples where scheme contains:

{
    "train_idx":   np.ndarray,              # fit-window indices into original DataFrame
    "val_idx":     np.ndarray | None,       # None when val_size=0
    "calib_idx":   np.ndarray | None,       # None when calib_size=0
    "test_idx":    np.ndarray,
    "train_period": (start_dt, end_dt),
    "val_period":   (start_dt, end_dt) | None,
    "calib_period": (start_dt, end_dt) | None,
    "test_period":  (start_dt, end_dt),
}

Conformal Prediction Splits#

Conformal prediction requires a dedicated calibration set that the model has never seen during training. TimeBasedCV supports this natively via three new parameters:

Parameter

Type

Default

Description

val_size

int

0

Validation window carved from the end of the fit window (early stopping).

calib_size

int

0

Calibration window for conformal score computation.

calib_source

str

"train_tail"

Where to draw the calibration set.

calib_source Options#

"train_tail" (default) — calib is carved from the end of the training window:

[────── fit ──────][── val ──][── calib ──][── gap ──][── test ──]

"gap" — calib occupies the gap between training and test (requires calib_size gap):

[──────── fit ────────][── calib ──][── gap remainder ──][── test ──]

"test_prefix" — calib is the first slice of the test window; evaluation uses the remainder:

[──────── fit ────────][── gap ──][── calib ──][── eval ──]

Example: Conformal CV with train_tail#

cv = TimeBasedCV(
    split_freq="months",
    train_size=8,
    test_size=1,
    val_size=1,    # 1 month for early stopping
    calib_size=2,  # 2 months for conformal calibration
    calib_source="train_tail",
    date_column="timestamp",
)

for bundle in cv.split(data):
    forecaster.fit(bundle.fit_df, val_df=bundle.val_df)
    forecaster.calibrate(bundle.calib_df)
    predictions, metrics = forecaster.evaluate_interval_forecast(bundle.test_df)

conformal_split() — Single-Fold Helper#

For experiments that use a single pre-defined train/test split rather than full CV:

from twiga.core.backtester import conformal_split

fit_df, val_df, calib_df, eval_df = conformal_split(
    train_df,
    test_df,
    calib_source="train_tail",  # or "test_prefix"
    calib_ratio=0.2,            # 20% of train → calibration
    val_ratio=0.0,              # no early-stopping split
)
forecaster.fit(fit_df)
forecaster.calibrate(calib_df)
predictions, metrics = forecaster.evaluate_interval_forecast(eval_df)

calib_source="gap" is not supported by conformal_split; use TimeBasedCV with calib_source="gap" instead.

Backtesting with TwigaForecaster#

The TwigaForecaster.backtesting() method runs the full train → evaluate cycle over each fold:

from twiga.core.config import DataPipelineConfig, ExperimentConfig
from twiga.forecaster.core import TwigaForecaster
from twiga.models.ml.xgboost_model import XGBOOSTConfig

data_config = DataPipelineConfig(
    target_feature="load_mw",
    period="1h",
    lookback_window_size=168,
    forecast_horizon=48,
)

train_config = ExperimentConfig(
    split_freq="months",
    train_size=3,
    test_size=1,
    window="expanding",
)

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params=[XGBOOSTConfig()],
    cv_params=train_config,
)

predictions_df, metrics_df = forecaster.backtesting(
    data=full_dataset,
    train_ratio=1.0,
    verbose=True,
    ensemble_strategy="mean",
)

What Happens per Fold#

For each fold, backtesting():

  1. Resets the data pipeline so scalers are re-fitted on this fold’s training window only.

  2. Calls self.fit(train_df, val_df=val_df) — fits the pipeline and all models. val_df is supplied when val_size > 0 and is used for early stopping.

  3. Routes evaluation through one of three paths depending on configuration:

    • Conformal path (calib_size > 0 and conformal_params set): calls calibrate(calib_df) then evaluate_interval_forecast(test_df).

    • Native interval path (eval_interval=True, no conformal): calls evaluate_interval_forecast(test_df) using intervals from the model’s own output.

    • Point path (default): calls evaluate_point_forecast(test_df).

  4. Adds a Folds column to track which fold produced each result and concatenates all results.

        sequenceDiagram
    participant B as backtesting()
    participant CV as TimeBasedCV.split()
    participant F as fit()
    participant C as calibrate()
    participant E as evaluate()

    B->>CV: Generate 4-way splits
    loop For each fold
        CV-->>B: (fit_df, val_df, calib_df, test_df)
        B->>B: Reset data pipeline
        B->>F: fit(fit_df, val_df)
        F-->>B: Models trained
        alt conformal path
            B->>C: calibrate(calib_df)
            C-->>B: Conformal scores computed
            B->>E: evaluate_interval_forecast(test_df)
        else native interval path
            B->>E: evaluate_interval_forecast(test_df)
        else point path
            B->>E: evaluate_point_forecast(test_df)
        end
        E-->>B: (predictions_df, metrics_df)
        B->>B: Append fold results
    end
    B-->>B: pd.concat(all_predictions, all_metrics)
    

Aggregating Results#

# Average metrics across folds
avg_metrics = metrics_df.groupby("Model")[["mae", "rmse", "smape"]].mean().round(3)

# Metrics per fold
fold_metrics = metrics_df.groupby(["Model", "Folds"])[["mae", "rmse"]].mean()

SplitState#

The SplitState class holds the time boundaries for a single split:

class SplitState:
    train_start: pd.Timestamp
    train_end: pd.Timestamp
    forecast_start: pd.Timestamp
    forecast_end: pd.Timestamp

The gap between train_end and forecast_start is controlled by the gap parameter.

Visualizing Splits#

TimeBasedCV provides a built-in visualization method (requires the plots dependency group):

cv.plot_split_scheme(data, train_ratio=1.0)

API Reference#

class twiga.core.backtester.TimeBasedCV(split_freq, test_size, train_size=None, gap=0, stride=None, window='rolling', date_column='timestamp', num_splits=None, *, val_size=0, calib_size=0, calib_source='train_tail')#

Bases: TimeBasedSplit

Concrete time-based cross-validation implementation for pandas DataFrames.

This class creates splits based on a datetime column and returns train/test indices, along with the corresponding time periods.

Variables:
  • date_column (str) – Name of the timestamp column in the DataFrame.

  • num_splits (int) – Optional number of splits to generate.

  • val_size (int) – Validation window size in split_freq units (default: 0).

  • calib_size (int) – Calibration window size in split_freq units (default: 0).

  • calib_source (str) – Where to draw the calibration set from. One of ‘train_tail’, ‘gap’, or ‘test_prefix’ (default: ‘train_tail’).

property calib_delta: relativedelta#

Calibration window duration.

duration_in_units(start, end, split_freq)#

Compute the duration between start and end in the specified split_freq units.

For ‘days’, ‘minutes’, ‘hours’, and ‘weeks’, a simple conversion based on timedelta is used. For ‘months’ and ‘years’, relativedelta is used to account for variable lengths.

Parameters:
  • start (Timestamp) – Start timestamp.

  • end (Timestamp) – End timestamp.

  • split_freq (str) – One of ‘days’, ‘minutes’, ‘hours’, ‘weeks’, ‘months’, or ‘years’.

Return type:

int

Returns:

int – Duration in the specified units.

Raises:

ValueError – If split_freq is unsupported.

get_scheme()#

Return the current split configuration.

Return type:

dict

Returns:

dict – A dictionary containing the train/test split indices and periods.

Raises:

ValueError – If the split scheme has not been initialized.

plot_split_scheme(data=None, train_ratio=1.0, start_dt=None, end_dt=None, title='Cross-validation split scheme', colors=None, alpha=0.88, x_ticks=6, font_size=10, line_width=0.8, x_axis_angle=30, legend_pos='top', legend_direction=None, legend_key_size=None, legend_border=False)#

Visualize the time series cross-validation split scheme.

Renders a Gantt-style plot with one horizontal bar per fold, colour-coded by segment (Train / Val / Calib / Test), styled with the Twiga theme.

Parameters:
  • data (DataFrame | None) – Input DataFrame containing temporal data. Used to derive the split scheme when it has not been pre-computed.

  • train_ratio (float) – Proportion of training indices used for training; the remainder becomes a validation segment. Used only when val_idx is not available in the scheme (i.e. val_size=0).

  • start_dt (Timestamp | None) – Optional start timestamp passed to set_split_scheme.

  • end_dt (Timestamp | None) – Optional end timestamp passed to set_split_scheme.

  • title (str) – Plot title.

  • colors (dict[str, str] | None) – Custom colour mapping for segments. Keys must be title-case: "Train", "Val", "Calib", "Test".

  • alpha (float) – Bar transparency (0–1).

  • x_ticks (int) – Number of date ticks on the x-axis.

  • font_size (int) – Base font size in points.

  • line_width (float) – Axis line stroke width.

  • x_axis_angle (int) – Rotation angle for x-axis tick labels.

  • legend_pos (str) – Legend position - "top", "bottom", "left", "right", or "none".

  • legend_direction (str | None) – Arrange legend keys "horizontal" or "vertical". Passed to twiga_theme().

  • legend_key_size (int | None) – Size in pixels of the colour swatch in each legend key. Passed to twiga_theme().

  • legend_border (bool) – If True, draw a thin grey border box around the legend. Passed to twiga_theme().

Returns:

A Lets-Plot ggplot object.

Raises:

ValueError – If train_ratio is outside [0, 1] or the split scheme is not initialised and no data is provided.

Example

>>> splitter = TimeBasedCV(split_freq="days", test_size=5, train_size=20, date_column="date")
>>> splitter.set_split_scheme(data["date"])
>>> splitter.plot_split_scheme(data, train_ratio=0.8, title="CV Scheme")
set_split_scheme(time_values, start_dt=None, end_dt=None)#

Calculate split indices from time series data.

The method sorts the datetime series, determines the time range to use, and computes indices for training and forecast periods based on the provided parameters. If num_splits is set, it adjusts train_size accordingly.

Parameters:
  • time_values (Series) – Datetime series to split.

  • start_dt (Timestamp | None) – Override start time (default: min of series).

  • end_dt (Timestamp | None) – Override end time (default: max of series).

Raises:

ValueError – If time_values is not a datetime series.

Return type:

None

split(data, start_dt=None, end_dt=None)#

Generate validated train/test splits.

Parameters:
Yields:

SplitBundle

Named tuple with fields (fit_df, val_df, calib_df, test_df, scheme, split_key).

val_df and calib_df are None when the corresponding size is 0.

Raises:

ValueError – If the required date column is missing or if the computed indices exceed data bounds.

property val_delta: relativedelta#

Validation window duration.

class twiga.core.backtester.TimeBasedSplit(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#

Bases: ABC

Abstract base class implementing core time-based splitting logic.

This class validates split parameters and provides properties to compute time deltas for the training period, forecast period, gap, and stride.

Variables:
  • split_freq (str) – Time unit for splits (e.g., ‘days’, ‘months’).

  • train_size (int) – Training period length in split_freq units.

  • test_size (int) – Forecast period length in split_freq units.

  • gap (int) – Gap between train and forecast periods.

  • stride (int) – Step size between splits.

  • window (str) – Window type (‘rolling’ or ‘expanding’).

__init__(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#

Initialize time-based split parameters.

Parameters:
  • split_freq (str) – Time unit for splits (e.g., ‘days’, ‘months’).

  • train_size (int) – Training period length in split_freq units.

  • test_size (int) – Forecast period length in split_freq units.

  • gap (int) – Gap between training and forecast periods (default: 0).

  • stride (int | None) – Step size between splits (default: test_size).

  • window (str) – Window type (‘rolling’ or ‘expanding’) (default: ‘rolling’).

Raises:

ValueError – If any parameter is invalid. In particular, train_size must be a positive integer that is greater than or equal to test_size.

property forecast_delta: relativedelta#

Calculate forecast period duration.

property gap_delta: relativedelta#

Calculate gap duration.

abstractmethod split(data)#

Generate train/test splits from data.

property stride_delta: relativedelta#

Calculate stride duration.

property train_delta: relativedelta#

Calculate training period duration.