Backtesting & Time-Based Cross-Validation#

Source Files
  • twiga/core/backtester.py

  • twiga/forecaster/core.py

Standard k-fold cross-validation does not work for time series because it breaks temporal ordering - a model could train on future data and predict the past. Twiga implements time-based cross-validation through the TimeBasedCV class, which generates chronologically ordered train/test splits.

How It Works#

        graph TD
    subgraph "Rolling Window Strategy"
        A["Fold 1: Train [t0..t3] → Test [t3..t4]"]
        B["Fold 2: Train [t1..t4] → Test [t4..t5]"]
        C["Fold 3: Train [t2..t5] → Test [t5..t6]"]
    end

    subgraph "Expanding Window Strategy"
        D["Fold 1: Train [t0..t3] → Test [t3..t4]"]
        E["Fold 2: Train [t0..t4] → Test [t4..t5]"]
        F["Fold 3: Train [t0..t5] → Test [t5..t6]"]
    end
    
  • Rolling window: Training window has a fixed size and slides forward each fold

  • Expanding window: Training window starts from the beginning and grows each fold

Class Hierarchy#

        classDiagram
    class TimeBasedSplit {
        <<abstract>>
        +split_freq: str
        +train_size: int
        +test_size: int
        +gap: int
        +stride: int
        +window: str
        +train_delta: relativedelta
        +forecast_delta: relativedelta
        +gap_delta: relativedelta
        +stride_delta: relativedelta
        #_splits_from_period()
        +split()*
    }

    class TimeBasedCV {
        +date_column: str
        +num_splits: int
        +split(data, start_dt, end_dt)
        +get_splits(data)
        +set_split_scheme()
        +get_scheme()
        +plot_split_scheme()
    }

    class SplitState {
        +train_start: Timestamp
        +train_end: Timestamp
        +forecast_start: Timestamp
        +forecast_end: Timestamp
    }

    TimeBasedSplit <|-- TimeBasedCV
    TimeBasedSplit ..> SplitState : creates
    

Configuration#

Backtesting behavior is controlled by ForecasterConfig parameters:

Parameter

Type

Default

Description

split_freq

Literal["days", "minutes", "hours", "weeks", "months", "years"]

"months"

Unit for train/test/gap/stride sizes

train_size

int

1

Training window length (in split_freq units)

test_size

int

1

Test window length (in split_freq units)

gap

int

0

Gap between training end and test start

stride

int | None

None

Step size between folds (defaults to test_size)

window

Literal["expanding", "rolling"]

"expanding"

Window strategy

num_splits

int | None

None

Maximum number of splits (None = all possible)

Example Configuration#

from twiga.core.config import ForecasterConfig

# Monthly backtesting: 6 months train, 1 month test, expanding window
config = ForecasterConfig(
    split_freq="months",
    train_size=6,
    test_size=1,
    gap=0,
    window="expanding",
)

# Daily backtesting: 14 days train, 7 days test, rolling window
config = ForecasterConfig(
    split_freq="days",
    train_size=14,
    test_size=7,
    gap=0,
    window="rolling",
    stride=7,  # move 7 days between folds
)

Using TimeBasedCV Directly#

The TimeBasedCV class can be used independently for custom splitting logic:

from twiga.core.backtester import TimeBasedCV

cv = TimeBasedCV(
    split_freq="months",
    train_size=6,
    test_size=1,
    gap=0,
    window="expanding",
    date_column="timestamp",
)

for train_df, test_df, scheme, fold_idx in cv.split(data):
    print(f"Fold {fold_idx + 1}:")
    print(f"  Train: {scheme['train_period'][0]} to {scheme['train_period'][1]}")
    print(f"  Test:  {scheme['test_period'][0]} to {scheme['test_period'][1]}")

The split() method yields tuples of (train_df, test_df, scheme_dict, fold_index) where scheme_dict contains:

{
    "train_idx": np.ndarray,        # indices into original DataFrame
    "test_idx": np.ndarray,
    "train_period": (start_dt, end_dt),
    "test_period": (start_dt, end_dt),
}

Backtesting with TwigaForecaster#

The TwigaForecaster.backtesting() method runs the full train → evaluate cycle over each fold:

from twiga.core.config import DataPipelineConfig, ForecasterConfig
from twiga.forecaster.core import TwigaForecaster
from twiga.models.ml.xgboost_model import XGBOOSTConfig

data_config = DataPipelineConfig(
    target_feature="load_mw",
    period="1h",
    lookback_window_size=168,
    forecast_horizon=48,
)

train_config = ForecasterConfig(
    split_freq="months",
    train_size=3,
    test_size=1,
    window="expanding",
)

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params=[XGBOOSTConfig()],
    train_params=train_config,
)

predictions_df, metrics_df = forecaster.backtesting(
    data=full_dataset,
    train_ratio=1.0,
    verbose=True,
    ensemble_strategy="mean",
)

What Happens per Fold#

For each fold, backtesting():

  1. Calls self.fit(train_df) - fits the data pipeline and all models

  2. Calls self.evaluate_point_forecast(test_df) - generates predictions and computes metrics

  3. Adds a Folds column to track which fold produced each result

  4. Concatenates all results across folds

        sequenceDiagram
    participant B as backtesting()
    participant CV as TimeBasedCV.split()
    participant F as fit()
    participant E as evaluate_point_forecast()

    B->>CV: Generate train/test splits
    loop For each fold
        CV-->>B: (train_df, test_df, scheme, fold_idx)
        B->>F: fit(train_df)
        F-->>B: Models trained
        B->>E: evaluate_point_forecast(test_df)
        E-->>B: (predictions_df, metrics_df)
        B->>B: Append fold results
    end
    B-->>B: pd.concat(all_predictions), pd.concat(all_metrics)
    

Aggregating Results#

# Average metrics across folds
avg_metrics = metrics_df.groupby("Model")[["mae", "rmse", "smape"]].mean().round(3)

# Metrics per fold
fold_metrics = metrics_df.groupby(["Model", "Folds"])[["mae", "rmse"]].mean()

SplitState#

The SplitState class holds the time boundaries for a single split:

class SplitState:
    train_start: pd.Timestamp
    train_end: pd.Timestamp
    forecast_start: pd.Timestamp
    forecast_end: pd.Timestamp

The gap between train_end and forecast_start is controlled by the gap parameter.

Visualizing Splits#

TimeBasedCV provides a built-in visualization method (requires the plots dependency group):

cv.plot_split_scheme(data, train_ratio=1.0)

API Reference#

class twiga.core.backtester.TimeBasedCV(split_freq, test_size, train_size=None, gap=0, stride=None, window='rolling', date_column='timestamp', num_splits=None)#

Bases: TimeBasedSplit

Concrete time-based cross-validation implementation for pandas DataFrames.

This class creates splits based on a datetime column and returns train/test indices, along with the corresponding time periods.

Variables:
  • date_column (str) – Name of the timestamp column in the DataFrame.

  • num_splits (int) – Optional number of splits to generate.

duration_in_units(start, end, split_freq)#

Compute the duration between start and end in the specified split_freq units.

For ‘days’, ‘minutes’, ‘hours’, and ‘weeks’, a simple conversion based on timedelta is used. For ‘months’ and ‘years’, relativedelta is used to account for variable lengths.

Parameters:
  • start (Timestamp) – Start timestamp.

  • end (Timestamp) – End timestamp.

  • split_freq (str) – One of ‘days’, ‘minutes’, ‘hours’, ‘weeks’, ‘months’, or ‘years’.

Return type:

int

Returns:

int – Duration in the specified units.

Raises:

ValueError – If split_freq is unsupported.

get_scheme()#

Return the current split configuration.

Return type:

dict

Returns:

dict – A dictionary containing the train/test split indices and periods.

Raises:

ValueError – If the split scheme has not been initialized.

plot_split_scheme(data=None, train_ratio=1.0, start_dt=None, end_dt=None, title='Cross-validation split scheme', colors=None, alpha=0.88, x_ticks=6, font_size=10, line_width=0.8, x_axis_angle=30, legend_pos='top')#

Visualize the time series cross-validation split scheme.

Renders a Gantt-style plot with one horizontal bar per fold, colour-coded by segment (Train / Val / Test), styled with the Twiga theme.

Parameters:
  • data (DataFrame | None) – Input DataFrame containing temporal data. Used to derive the split scheme when it has not been pre-computed.

  • train_ratio (float) – Proportion of training indices used for training; the remainder becomes a validation segment.

  • start_dt (Timestamp | None) – Optional start timestamp passed to set_split_scheme.

  • end_dt (Timestamp | None) – Optional end timestamp passed to set_split_scheme.

  • title (str) – Plot title.

  • colors (dict[str, str] | None) – Custom colour mapping for segments. Keys must be title-case: "Train", "Val", "Test".

  • alpha (float) – Bar transparency (0–1).

  • x_ticks (int) – Number of date ticks on the x-axis.

  • font_size (int) – Base font size in points.

  • line_width (float) – Axis line stroke width.

  • x_axis_angle (int) – Rotation angle for x-axis tick labels.

  • legend_pos (str) – Legend position - "top", "bottom", "left", "right", or "none".

Returns:

A Lets-Plot ggplot object.

Raises:

ValueError – If train_ratio is outside [0, 1] or the split scheme is not initialised and no data is provided.

Example

>>> splitter = TimeBasedCV(split_freq="days", test_size=5, train_size=20, date_column="date")
>>> splitter.set_split_scheme(data["date"])
>>> splitter.plot_split_scheme(data, train_ratio=0.8, title="CV Scheme")
set_split_scheme(time_values, start_dt=None, end_dt=None)#

Calculate split indices from time series data.

The method sorts the datetime series, determines the time range to use, and computes indices for training and forecast periods based on the provided parameters. If num_splits is set, it adjusts train_size accordingly.

Parameters:
  • time_values (Series) – Datetime series to split.

  • start_dt (Timestamp | None) – Override start time (default: min of series).

  • end_dt (Timestamp | None) – Override end time (default: max of series).

Raises:

ValueError – If time_values is not a datetime series.

Return type:

None

split(data, start_dt=None, end_dt=None)#

Generate validated train/test splits.

Parameters:
Yields:

tuple – (train_df, test_df, scheme, split_key).

Raises:

ValueError – If the required date column is missing or if the computed indices exceed data bounds.

class twiga.core.backtester.TimeBasedSplit(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#

Bases: ABC

Abstract base class implementing core time-based splitting logic.

This class validates split parameters and provides properties to compute time deltas for the training period, forecast period, gap, and stride.

Variables:
  • split_freq (str) – Time unit for splits (e.g., ‘days’, ‘months’).

  • train_size (int) – Training period length in split_freq units.

  • test_size (int) – Forecast period length in split_freq units.

  • gap (int) – Gap between train and forecast periods.

  • stride (int) – Step size between splits.

  • window (str) – Window type (‘rolling’ or ‘expanding’).

__init__(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#

Initialize time-based split parameters.

Parameters:
  • split_freq (str) – Time unit for splits (e.g., ‘days’, ‘months’).

  • train_size (int) – Training period length in split_freq units.

  • test_size (int) – Forecast period length in split_freq units.

  • gap (int) – Gap between training and forecast periods (default: 0).

  • stride (int | None) – Step size between splits (default: test_size).

  • window (str) – Window type (‘rolling’ or ‘expanding’) (default: ‘rolling’).

Raises:

ValueError – If any parameter is invalid. In particular, train_size must be a positive integer that is greater than or equal to test_size.

property forecast_delta: relativedelta#

Calculate forecast period duration.

property gap_delta: relativedelta#

Calculate gap duration.

abstractmethod split(data)#

Generate train/test splits from data.

property stride_delta: relativedelta#

Calculate stride duration.

property train_delta: relativedelta#

Calculate training period duration.