Backtesting & Time-Based Cross-Validation#
Source Files
twiga/core/backtester.pytwiga/forecaster/core.py
Standard k-fold cross-validation does not work for time series because it breaks temporal ordering - a model could train on future data and predict the past. Twiga implements time-based cross-validation through the TimeBasedCV class, which generates chronologically ordered train/test splits.
How It Works#
graph TD
subgraph "Rolling Window Strategy"
A["Fold 1: Train [t0..t3] → Test [t3..t4]"]
B["Fold 2: Train [t1..t4] → Test [t4..t5]"]
C["Fold 3: Train [t2..t5] → Test [t5..t6]"]
end
subgraph "Expanding Window Strategy"
D["Fold 1: Train [t0..t3] → Test [t3..t4]"]
E["Fold 2: Train [t0..t4] → Test [t4..t5]"]
F["Fold 3: Train [t0..t5] → Test [t5..t6]"]
end
Rolling window: Training window has a fixed size and slides forward each fold
Expanding window: Training window starts from the beginning and grows each fold
Class Hierarchy#
classDiagram
class TimeBasedSplit {
<<abstract>>
+split_freq: str
+train_size: int
+test_size: int
+gap: int
+stride: int
+window: str
+train_delta: relativedelta
+forecast_delta: relativedelta
+gap_delta: relativedelta
+stride_delta: relativedelta
#_splits_from_period()
+split()*
}
class TimeBasedCV {
+date_column: str
+num_splits: int
+split(data, start_dt, end_dt)
+get_splits(data)
+set_split_scheme()
+get_scheme()
+plot_split_scheme()
}
class SplitState {
+train_start: Timestamp
+train_end: Timestamp
+forecast_start: Timestamp
+forecast_end: Timestamp
}
TimeBasedSplit <|-- TimeBasedCV
TimeBasedSplit ..> SplitState : creates
Configuration#
Backtesting behavior is controlled by ExperimentConfig parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Unit for train/test/gap/stride sizes |
|
|
|
Training window length (in |
|
|
|
Test window length (in |
|
|
|
Gap between training end and test start |
|
|
|
Step size between folds (defaults to |
|
|
|
Window strategy |
|
|
|
Maximum number of splits (None = all possible) |
Example Configuration#
from twiga.core.config import ExperimentConfig
# Monthly backtesting: 6 months train, 1 month test, expanding window
config = ExperimentConfig(
split_freq="months",
train_size=6,
test_size=1,
gap=0,
window="expanding",
)
# Daily backtesting: 14 days train, 7 days test, rolling window
config = ExperimentConfig(
split_freq="days",
train_size=14,
test_size=7,
gap=0,
window="rolling",
stride=7, # move 7 days between folds
)
Using TimeBasedCV Directly#
The TimeBasedCV class can be used independently for custom splitting logic:
from twiga.core.backtester import TimeBasedCV
cv = TimeBasedCV(
split_freq="months",
train_size=6,
test_size=1,
gap=0,
window="expanding",
date_column="timestamp",
)
for bundle in cv.split(data):
print(f"Fold {bundle.split_key + 1}:")
print(f" Train: {bundle.scheme['train_period'][0]} to {bundle.scheme['train_period'][1]}")
print(f" Test: {bundle.scheme['test_period'][0]} to {bundle.scheme['test_period'][1]}")
The split() method yields SplitBundle named tuples where scheme contains:
{
"train_idx": np.ndarray, # fit-window indices into original DataFrame
"val_idx": np.ndarray | None, # None when val_size=0
"calib_idx": np.ndarray | None, # None when calib_size=0
"test_idx": np.ndarray,
"train_period": (start_dt, end_dt),
"val_period": (start_dt, end_dt) | None,
"calib_period": (start_dt, end_dt) | None,
"test_period": (start_dt, end_dt),
}
Conformal Prediction Splits#
Conformal prediction requires a dedicated calibration set that the model has never seen during training. TimeBasedCV supports this natively via three new parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Validation window carved from the end of the fit window (early stopping). |
|
|
|
Calibration window for conformal score computation. |
|
|
|
Where to draw the calibration set. |
calib_source Options#
"train_tail" (default) — calib is carved from the end of the training window:
[────── fit ──────][── val ──][── calib ──][── gap ──][── test ──]
"gap" — calib occupies the gap between training and test (requires calib_size ≤ gap):
[──────── fit ────────][── calib ──][── gap remainder ──][── test ──]
"test_prefix" — calib is the first slice of the test window; evaluation uses the remainder:
[──────── fit ────────][── gap ──][── calib ──][── eval ──]
Example: Conformal CV with train_tail#
cv = TimeBasedCV(
split_freq="months",
train_size=8,
test_size=1,
val_size=1, # 1 month for early stopping
calib_size=2, # 2 months for conformal calibration
calib_source="train_tail",
date_column="timestamp",
)
for bundle in cv.split(data):
forecaster.fit(bundle.fit_df, val_df=bundle.val_df)
forecaster.calibrate(bundle.calib_df)
predictions, metrics = forecaster.evaluate_interval_forecast(bundle.test_df)
conformal_split() — Single-Fold Helper#
For experiments that use a single pre-defined train/test split rather than full CV:
from twiga.core.backtester import conformal_split
fit_df, val_df, calib_df, eval_df = conformal_split(
train_df,
test_df,
calib_source="train_tail", # or "test_prefix"
calib_ratio=0.2, # 20% of train → calibration
val_ratio=0.0, # no early-stopping split
)
forecaster.fit(fit_df)
forecaster.calibrate(calib_df)
predictions, metrics = forecaster.evaluate_interval_forecast(eval_df)
calib_source="gap" is not supported by conformal_split; use TimeBasedCV with calib_source="gap" instead.
Backtesting with TwigaForecaster#
The TwigaForecaster.backtesting() method runs the full train → evaluate cycle over each fold:
from twiga.core.config import DataPipelineConfig, ExperimentConfig
from twiga.forecaster.core import TwigaForecaster
from twiga.models.ml.xgboost_model import XGBOOSTConfig
data_config = DataPipelineConfig(
target_feature="load_mw",
period="1h",
lookback_window_size=168,
forecast_horizon=48,
)
train_config = ExperimentConfig(
split_freq="months",
train_size=3,
test_size=1,
window="expanding",
)
forecaster = TwigaForecaster(
data_params=data_config,
model_params=[XGBOOSTConfig()],
cv_params=train_config,
)
predictions_df, metrics_df = forecaster.backtesting(
data=full_dataset,
train_ratio=1.0,
verbose=True,
ensemble_strategy="mean",
)
What Happens per Fold#
For each fold, backtesting():
Resets the data pipeline so scalers are re-fitted on this fold’s training window only.
Calls
self.fit(train_df, val_df=val_df)— fits the pipeline and all models.val_dfis supplied whenval_size > 0and is used for early stopping.Routes evaluation through one of three paths depending on configuration:
Conformal path (
calib_size > 0andconformal_paramsset): callscalibrate(calib_df)thenevaluate_interval_forecast(test_df).Native interval path (
eval_interval=True, no conformal): callsevaluate_interval_forecast(test_df)using intervals from the model’s own output.Point path (default): calls
evaluate_point_forecast(test_df).
Adds a
Foldscolumn to track which fold produced each result and concatenates all results.
sequenceDiagram
participant B as backtesting()
participant CV as TimeBasedCV.split()
participant F as fit()
participant C as calibrate()
participant E as evaluate()
B->>CV: Generate 4-way splits
loop For each fold
CV-->>B: (fit_df, val_df, calib_df, test_df)
B->>B: Reset data pipeline
B->>F: fit(fit_df, val_df)
F-->>B: Models trained
alt conformal path
B->>C: calibrate(calib_df)
C-->>B: Conformal scores computed
B->>E: evaluate_interval_forecast(test_df)
else native interval path
B->>E: evaluate_interval_forecast(test_df)
else point path
B->>E: evaluate_point_forecast(test_df)
end
E-->>B: (predictions_df, metrics_df)
B->>B: Append fold results
end
B-->>B: pd.concat(all_predictions, all_metrics)
Aggregating Results#
# Average metrics across folds
avg_metrics = metrics_df.groupby("Model")[["mae", "rmse", "smape"]].mean().round(3)
# Metrics per fold
fold_metrics = metrics_df.groupby(["Model", "Folds"])[["mae", "rmse"]].mean()
SplitState#
The SplitState class holds the time boundaries for a single split:
class SplitState:
train_start: pd.Timestamp
train_end: pd.Timestamp
forecast_start: pd.Timestamp
forecast_end: pd.Timestamp
The gap between train_end and forecast_start is controlled by the gap parameter.
Visualizing Splits#
TimeBasedCV provides a built-in visualization method (requires the plots dependency group):
cv.plot_split_scheme(data, train_ratio=1.0)
API Reference#
- class twiga.core.backtester.TimeBasedCV(split_freq, test_size, train_size=None, gap=0, stride=None, window='rolling', date_column='timestamp', num_splits=None, *, val_size=0, calib_size=0, calib_source='train_tail')#
Bases:
TimeBasedSplitConcrete time-based cross-validation implementation for pandas DataFrames.
This class creates splits based on a datetime column and returns train/test indices, along with the corresponding time periods.
- Variables:
date_column (str) – Name of the timestamp column in the DataFrame.
num_splits (int) – Optional number of splits to generate.
val_size (int) – Validation window size in split_freq units (default: 0).
calib_size (int) – Calibration window size in split_freq units (default: 0).
calib_source (str) – Where to draw the calibration set from. One of ‘train_tail’, ‘gap’, or ‘test_prefix’ (default: ‘train_tail’).
- property calib_delta: relativedelta#
Calibration window duration.
- duration_in_units(start, end, split_freq)#
Compute the duration between start and end in the specified split_freq units.
For ‘days’, ‘minutes’, ‘hours’, and ‘weeks’, a simple conversion based on timedelta is used. For ‘months’ and ‘years’, relativedelta is used to account for variable lengths.
- Parameters:
- Return type:
- Returns:
int – Duration in the specified units.
- Raises:
ValueError – If split_freq is unsupported.
- get_scheme()#
Return the current split configuration.
- Return type:
- Returns:
dict – A dictionary containing the train/test split indices and periods.
- Raises:
ValueError – If the split scheme has not been initialized.
- plot_split_scheme(data=None, train_ratio=1.0, start_dt=None, end_dt=None, title='Cross-validation split scheme', colors=None, alpha=0.88, x_ticks=6, font_size=10, line_width=0.8, x_axis_angle=30, legend_pos='top', legend_direction=None, legend_key_size=None, legend_border=False)#
Visualize the time series cross-validation split scheme.
Renders a Gantt-style plot with one horizontal bar per fold, colour-coded by segment (Train / Val / Calib / Test), styled with the Twiga theme.
- Parameters:
data (
DataFrame|None) – Input DataFrame containing temporal data. Used to derive the split scheme when it has not been pre-computed.train_ratio (
float) – Proportion of training indices used for training; the remainder becomes a validation segment. Used only when val_idx is not available in the scheme (i.e. val_size=0).start_dt (
Timestamp|None) – Optional start timestamp passed toset_split_scheme.end_dt (
Timestamp|None) – Optional end timestamp passed toset_split_scheme.title (
str) – Plot title.colors (
dict[str,str] |None) – Custom colour mapping for segments. Keys must be title-case:"Train","Val","Calib","Test".alpha (
float) – Bar transparency (0–1).x_ticks (
int) – Number of date ticks on the x-axis.font_size (
int) – Base font size in points.line_width (
float) – Axis line stroke width.x_axis_angle (
int) – Rotation angle for x-axis tick labels.legend_pos (
str) – Legend position -"top","bottom","left","right", or"none".legend_direction (
str|None) – Arrange legend keys"horizontal"or"vertical". Passed totwiga_theme().legend_key_size (
int|None) – Size in pixels of the colour swatch in each legend key. Passed totwiga_theme().legend_border (
bool) – IfTrue, draw a thin grey border box around the legend. Passed totwiga_theme().
- Returns:
A Lets-Plot
ggplotobject.- Raises:
ValueError – If
train_ratiois outside [0, 1] or the split scheme is not initialised and nodatais provided.
Example
>>> splitter = TimeBasedCV(split_freq="days", test_size=5, train_size=20, date_column="date") >>> splitter.set_split_scheme(data["date"]) >>> splitter.plot_split_scheme(data, train_ratio=0.8, title="CV Scheme")
- set_split_scheme(time_values, start_dt=None, end_dt=None)#
Calculate split indices from time series data.
The method sorts the datetime series, determines the time range to use, and computes indices for training and forecast periods based on the provided parameters. If num_splits is set, it adjusts train_size accordingly.
- split(data, start_dt=None, end_dt=None)#
Generate validated train/test splits.
- Parameters:
- Yields:
SplitBundle –
- Named tuple with fields (fit_df, val_df, calib_df, test_df, scheme, split_key).
val_df and calib_df are None when the corresponding size is 0.
- Raises:
ValueError – If the required date column is missing or if the computed indices exceed data bounds.
- property val_delta: relativedelta#
Validation window duration.
- class twiga.core.backtester.TimeBasedSplit(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#
Bases:
ABCAbstract base class implementing core time-based splitting logic.
This class validates split parameters and provides properties to compute time deltas for the training period, forecast period, gap, and stride.
- Variables:
split_freq (str) – Time unit for splits (e.g., ‘days’, ‘months’).
train_size (int) – Training period length in split_freq units.
test_size (int) – Forecast period length in split_freq units.
gap (int) – Gap between train and forecast periods.
stride (int) – Step size between splits.
window (str) – Window type (‘rolling’ or ‘expanding’).
- __init__(split_freq, train_size, test_size, gap=0, stride=None, window='rolling')#
Initialize time-based split parameters.
- Parameters:
split_freq (
str) – Time unit for splits (e.g., ‘days’, ‘months’).train_size (
int) – Training period length in split_freq units.test_size (
int) – Forecast period length in split_freq units.gap (
int) – Gap between training and forecast periods (default: 0).stride (
int|None) – Step size between splits (default: test_size).window (
str) – Window type (‘rolling’ or ‘expanding’) (default: ‘rolling’).
- Raises:
ValueError – If any parameter is invalid. In particular, train_size must be a positive integer that is greater than or equal to test_size.
- property forecast_delta: relativedelta#
Calculate forecast period duration.
- property gap_delta: relativedelta#
Calculate gap duration.
- abstractmethod split(data)#
Generate train/test splits from data.
- property stride_delta: relativedelta#
Calculate stride duration.
- property train_delta: relativedelta#
Calculate training period duration.