Backtesting & Robust Evaluation#
What you’ll build
A multi-fold backtesting pipeline for LightGBM using both expanding-window and rolling-window strategies, with per-fold metric visualisation and a strategy comparison table that shows which approach is more appropriate for your problem.
Prerequisites
01 - Getting Started (DataPipelineConfig, ForecasterConfig, TwigaForecaster.fit)
05 - ML Point Forecasting (multi-model training and metric interpretation)
Python: pandas basics, list comprehensions
Learning objectives
By the end of this notebook you will be able to:
Explain why a single train/test split is insufficient for time series evaluation
Configure expanding-window and rolling-window cross-validation via
ForecasterConfigVisualise the split scheme with
plot_split_scheme()to confirm no data leakageInterpret per-fold metric distributions and identify unstable models
Decide between expanding and rolling strategies based on dataset characteristics
1 Setup#
import warnings
warnings.filterwarnings("ignore")
from great_tables import GT
from lets_plot import LetsPlot
import pandas as pd
LetsPlot.setup_html()
from twiga.core.plot import (
plot_forecast,
plot_forecast_grid,
plot_metrics_bar,
)
from twiga.core.utils import configure, get_logger
configure()
log = get_logger("tutorials")
data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature"])
data["timestamp"] = pd.to_datetime(data["timestamp"])
# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)
log.info("Dataset shape : %s", data.shape)
log.info("Date range : %s -> %s", data["timestamp"].min(), data["timestamp"].max())
GT(data.round(2).head())
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature"])
2 data["timestamp"] = pd.to_datetime(data["timestamp"])
3
4 # Restrict to 2019-2020 to keep tutorial execution fast
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
256 if manager == "array":
257 to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
261 storage_options=storage_options,
262 mode="rb",
263 )
264 try:
265 pa_table = self.api.parquet.read_table(
266 path_or_handle,
267 columns=columns,
(...) 270 **kwargs,
271 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
131 handles = None
132 if (
133 not fs
134 and not is_dir
(...) 139 # fsspec resources can also point to directories
140 # this branch is used for example when reading from non-fsspec URLs
--> 141 handles = get_handle(
142 path_or_handle, mode, is_text=False, storage_options=storage_options
143 )
144 fs = None
145 path_or_handle = handles.handle
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
873 handle = open(
874 handle,
875 ioargs.mode,
(...) 878 newline="",
879 )
880 else:
881 # Binary mode
--> 882 handle = open(handle, ioargs.mode)
883 handles.append(handle)
885 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'
2 Why a single train/test split is insufficient#
Key concept - temporal leakage
In standard ML, shuffling data before splitting is common practice. For time series this is catastrophic: if any training sample is drawn from after the test period, the model has effectively “seen the future” during training. Twiga’s
TimeBasedCVprevents this by construction - every training fold ends strictly before its corresponding test fold starts.
Key concept - why backtesting matters
A single train/test split gives one performance number. That number depends entirely on which period you happened to pick - a mild winter month looks very different from a heat-wave month for net-load forecasting. Backtesting replaces that single estimate with a distribution of performance scores across many non-overlapping test windows. The mean is more reliable; the variance tells you whether the model is consistent or seasonal-brittle.
When we evaluate a forecasting model on a single fixed test period we get one performance number. Time series data also drifts: seasons change, demand patterns evolve, solar irradiance profiles shift. A model trained on data from January may generalise poorly to August.
Key concept - expanding-window cross-validation
Instead of one fixed test window, Twiga evaluates across multiple consecutive folds. With
window="expanding", the training set grows with each fold - Fold 1 trains on 5 months, Fold 2 on 6 months, and so on. Each fold tests on a fresh period the model never saw. The final reported metric is the mean across all folds, which is far more reliable than a single-split estimate and reveals whether performance is stable over time.
Time-based cross-validation addresses both problems:
We evaluate across multiple non-overlapping test windows - this reduces the variance of our performance estimate and reveals how consistent the model is across time.
We always respect the arrow of time: training data always precede test data, so there is no data leakage.
Twiga implements this via TimeBasedCV, configured through ForecasterConfig. Two window
strategies are available - the table below summarises the trade-offs.
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
strategy_df = pd.DataFrame(
{
"Strategy": ["Expanding", "Rolling (sliding)"],
"Description": [
"Training set grows with each fold — Fold 1: N months, Fold 2: N+1 months, ...",
"Training window slides forward at fixed size — always trains on exactly N months",
],
"Best for": [
"Stable distributions — older data is still informative",
"Drifting distributions — recent data matters more than old data",
],
"Trade-off": [
"More data per fold → lower variance, but may include stale patterns",
"Fixed data volume → consistent training conditions, but discards early data",
],
}
)
twiga_gt(
GT(strategy_df)
.tab_header(
title=md("**Cross-Validation Window Strategies**"),
subtitle="Expanding vs. Rolling — when to use which",
)
.cols_label(
Strategy=md("**Strategy**"),
Description=md("**Description**"),
**{"Best for": md("**Best for**")},
**{"Trade-off": md("**Trade-off**")},
)
.tab_source_note("Twiga Forecast"),
n_rows=len(strategy_df),
)
from lets_plot import (
aes,
facet_wrap,
geom_rect,
ggplot,
ggsize,
labs,
scale_fill_manual,
scale_y_continuous,
)
import pandas as pd
from twiga.core.plot.theme import TWIGA_PALETTE, twiga_theme
_N_FOLDS = 5
_TRAIN_SIZE = 3 # relative time units
_TEST_SIZE = 1
_BAR_H = 0.55
_FILL = {"Train": TWIGA_PALETTE[0], "Test": TWIGA_PALETTE[1]}
def _make_cv_df(strategy: str) -> pd.DataFrame:
rows = []
for fold in range(_N_FOLDS):
y = fold + 1
t_start = 0 if strategy == "Expanding" else fold
t_end = _TRAIN_SIZE + fold if strategy == "Expanding" else fold + _TRAIN_SIZE
rows += [
{
"xmin": t_start,
"xmax": t_end,
"ymin": y - _BAR_H / 2,
"ymax": y + _BAR_H / 2,
"segment": "Train",
"strategy": strategy,
},
{
"xmin": t_end,
"xmax": t_end + _TEST_SIZE,
"ymin": y - _BAR_H / 2,
"ymax": y + _BAR_H / 2,
"segment": "Test",
"strategy": strategy,
},
]
return pd.DataFrame(rows)
_df = pd.concat([_make_cv_df("Expanding"), _make_cv_df("Rolling")], ignore_index=True)
_df["strategy"] = _df["strategy"].map(
{
"Expanding": "Expanding window (train grows each fold)",
"Rolling": "Rolling window (fixed train size, slides forward)",
}
)
(
ggplot(_df)
+ geom_rect(
aes(xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax", fill="segment"),
size=0,
alpha=0.88,
)
+ facet_wrap("strategy", ncol=1)
+ scale_fill_manual(values=_FILL)
+ scale_y_continuous(
breaks=list(range(1, _N_FOLDS + 1)),
labels=[f"Fold {i}" for i in range(1, _N_FOLDS + 1)],
)
+ labs(x="Time (months)", y="", fill="", title="Cross-validation strategies")
+ twiga_theme(grid=False, legend_pos="top")
+ ggsize(980, 600)
)
3 ForecasterConfig: the split parameters#
All cross-validation behaviour is controlled by ForecasterConfig. The table below documents each parameter.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
params_df = pd.DataFrame(
{
"Parameter": ["split_freq", "train_size", "test_size", "window", "stride", "num_splits", "gap", "project_name"],
"Type": ["str", "int", "int", "str", "int", "int", "int", "str"],
"Meaning": [
'Unit for all size parameters ("months", "days", "hours", …)',
"How many split_freq units each training fold covers",
"How many split_freq units each test fold covers",
'"expanding" (train grows) or "rolling" (fixed window slides)',
"How far the test window moves forward between folds",
"Maximum number of folds to generate",
"Unused periods between train end and test start (prevents leakage on slow-moving features)",
"Identifier used for logging and artefact paths",
],
}
)
twiga_gt(
GT(params_df)
.tab_header(
title=md("**ForecasterConfig — Cross-Validation Parameters**"),
subtitle="All temporal splits are controlled by these fields",
)
.cols_label(
Parameter=md("**Parameter**"),
Type=md("**Type**"),
Meaning=md("**Meaning**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(params_df),
)
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
params_df = pd.DataFrame(
{
"Parameter": ["split_freq", "train_size", "test_size", "window", "stride", "num_splits", "gap", "project_name"],
"Type": ["str", "int", "int", "str", "int", "int", "int", "str"],
"Meaning": [
"Unit for all size parameters: 'months', 'days', 'hours', …",
"How many split_freq units each training fold covers",
"How many split_freq units each test fold covers",
"'expanding' — train grows each fold; 'rolling' — fixed train size",
"How far the test window moves forward between folds",
"Maximum number of folds to generate",
"Unused periods between train end and test start (avoids overlap)",
"Identifier used for logging and artefact paths",
],
}
)
twiga_gt(
GT(params_df)
.tab_header(
title=md("**ForecasterConfig Parameters**"),
subtitle="All CV behaviour is controlled through these settings",
)
.cols_label(
Parameter=md("**Parameter**"),
Type=md("**Type**"),
Meaning=md("**Meaning**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(params_df),
)
from sklearn.preprocessing import RobustScaler, StandardScaler
from twiga.core.config import DataPipelineConfig, ForecasterConfig
data_config = DataPipelineConfig(
target_feature="NetLoad(kW)",
period="30min",
latitude=32.371666,
longitude=-16.274998,
calendar_features=["hour", "day_night"],
exogenous_features=["Ghi"],
forecast_horizon=48,
lookback_window_size=96,
input_scaler=StandardScaler(),
target_scaler=RobustScaler(),
)
train_config = ForecasterConfig(
split_freq="months",
train_size=3,
test_size=1,
window="expanding",
stride=1,
num_splits=3,
project_name="backtesting-tutorial",
)
log.info("DataPipelineConfig:\n%s", data_config.model_dump(exclude_none=True))
log.info("ForecasterConfig:\n%s", train_config.model_dump(exclude_none=True))
4 Instantiate the forecaster#
from twiga import TwigaForecaster
from twiga.models.ml import LIGHTGBMConfig
forecaster = TwigaForecaster(
data_params=data_config,
model_params=[LIGHTGBMConfig()],
train_params=train_config,
)
log.info("Forecaster ready.")
log.info(" window : %s", forecaster.window)
log.info(" splits : %s", forecaster.num_splits)
log.info(" train_sz : %s %s", forecaster.train_size, forecaster.split_freq)
log.info(" test_sz : %s %s", forecaster.test_size, forecaster.split_freq)
5 Visualising the split scheme#
Before training a single model it is good practice to inspect the split layout. Twiga provides
plot_split_scheme(), which returns a lets_plot chart. Calling it with data=data also
initialises the internal split state, so the same scheme is reused during backtesting.
plot = forecaster.plot_split_scheme(data=data)
plot
The chart shows the train (blue) and test (orange) windows for each fold.
Because we chose window="expanding", each blue bar is longer than the previous one -
the model sees progressively more history as we move forward in time.
6 Running backtesting#
from IPython.display import clear_output
forecast_df, metrics_df = forecaster.backtesting(data=data)
clear_output()
log.info("forecast_df shape: %s", forecast_df.shape)
log.info("metrics_df shape : %s", metrics_df.shape)
log.info("metrics_df columns: %s", metrics_df.columns.tolist())
log.info("=== Forecast DataFrame (first 5 rows) ===\n%s", forecast_df.head().to_string())
log.info("=== Metrics DataFrame (first 5 rows) ===\n%s", metrics_df.head().to_string())
What the DataFrames contain#
backtesting() returns two DataFrames. The tables below document each column.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
forecast_schema = pd.DataFrame(
{
"Column": ["Model", "target", "forecast", "Actual", "Folds"],
"Description": [
"Model name in upper-case (e.g. LIGHTGBM)",
"Target variable name from DataPipelineConfig",
"Model point prediction",
"Observed value from the test set",
"Fold index (1-based) indicating which CV fold this row belongs to",
],
}
)
twiga_gt(
GT(forecast_schema)
.tab_header(
title=md("**`forecast_df` — one row per timestep per model**"),
subtitle="Indexed by UTC timestamp",
)
.cols_label(
Column=md("**Column**"),
Description=md("**Description**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(forecast_schema),
)
metrics_schema = pd.DataFrame(
{
"Column": ["target", "Model", "mae", "rmse", "corr", "wmape", "smape", "nbias", "Folds"],
"Description": [
"Target variable name",
"Model name in upper-case",
"Mean absolute error (same units as target)",
"Root-mean-squared error (penalises large spikes)",
"Pearson correlation between forecast and actual",
"Weighted mean absolute percentage error",
"Symmetric mean absolute percentage error",
"Normalised bias (positive = systematic over-forecasting)",
"Fold index (1-based)",
],
}
)
twiga_gt(
GT(metrics_schema)
.tab_header(
title=md("**`metrics_df` — one row per day per model**"),
subtitle="Indexed by daily timestamp",
)
.cols_label(
Column=md("**Column**"),
Description=md("**Description**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(metrics_schema),
)
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
forecast_cols = pd.DataFrame(
{
"Column": ["Model", "target", "forecast", "Actual", "Folds"],
"Description": [
"Model name in upper-case (e.g. LIGHTGBM)",
"Target variable name from DataPipelineConfig",
"Model point prediction (inverse-scaled to original units)",
"Observed value from the test fold",
"Fold index (1-based)",
],
"DataFrame": ["forecast_df"] * 5,
}
)
metrics_cols = pd.DataFrame(
{
"Column": ["target", "Model", "mae", "rmse", "corr", "wmape", "smape", "nbias", "Folds"],
"Description": [
"Target variable name",
"Model name",
"Mean absolute error",
"Root-mean-squared error",
"Pearson correlation (actuals vs. forecasts)",
"Weighted mean absolute percentage error",
"Symmetric mean absolute percentage error",
"Normalised bias (signed average error)",
"Fold index",
],
"DataFrame": ["metrics_df"] * 9,
}
)
combined = pd.concat([forecast_cols, metrics_cols], ignore_index=True)
twiga_gt(
GT(combined)
.tab_header(
title=md("**Backtesting Output Columns**"),
subtitle="forecast_df and metrics_df returned by forecaster.backtesting()",
)
.cols_label(
Column=md("**Column**"),
Description=md("**Description**"),
DataFrame=md("**DataFrame**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(combined),
)
7 Per-fold results#
fold_col = "Folds" if "Folds" in metrics_df.columns else ("fold" if "fold" in metrics_df.columns else None)
if fold_col is not None:
fold_metrics = metrics_df.groupby(["Model", fold_col])[["mae", "rmse", "corr"]].mean().round(3)
log.info("Per-fold metrics (MAE / RMSE / Corr):\n%s", fold_metrics.to_string())
else:
log.info("No fold column found. Available columns: %s", metrics_df.columns.tolist())
agg_metrics = metrics_df.groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
log.info("Aggregate metrics (mean over all folds):\n%s", agg_metrics.to_string())
Interpreting per-fold variance#
Interpretation - Scan the per-fold MAE column. A stable spread (e.g., all folds within ±0.5 kW of the mean) indicates the model generalises well across seasons. A large spike in one fold (e.g., Fold 2 MAE = 3.9 vs. the 3.3 - 3.7 range for others) points to a specific time window where the model struggles - investigate what was different in that period (unusual weather, holiday effects, data quality issues).
High variance across folds (e.g., fold 1 MAE = 50 but fold 6 MAE = 120) indicates that model performance degrades as the dataset grows older or seasonal conditions change. If you observe this pattern consider:
Adding more seasonal calendar features
Retraining more frequently in production
Switching from
expandingtorollingwindow to prevent stale data dominating
8 Reading the performance table#
twiga_report wraps Great Tables to produce a colour-highlighted summary.
Green cells mark the best value for each metric (lower is better for error metrics, higher is
better for correlation).
from twiga.core.plot.gt import twiga_report
res = metrics_df.groupby("Model")[["mae", "corr", "nbias", "rmse", "wmape", "smape"]].mean().round(2).reset_index()
res = res.rename(
columns={
"mae": "MAE",
"corr": "Corr",
"wmape": "WMAPE",
"smape": "SMAPE",
"nbias": "NBIAS",
"rmse": "RMSE",
}
)
twiga_report(
res,
metrics=["MAE", "Corr", "SMAPE", "RMSE"],
minimize_cols=["MAE", "SMAPE", "RMSE"],
maximize_cols=["Corr"],
)
Interpretation - The teal-highlighted cells mark the best value per column. For a single-model backtesting run, all highlights will be in the LIGHTGBM row by definition. The value of this table grows when you add multiple models (see NB04): compare MAE, RMSE, and Corr together - a model that wins on MAE but loses on Corr may be fitting noise rather than signal. An RMSE significantly larger than MAE indicates sporadic large errors; investigate those forecast timesteps directly in
forecast_df.
9 Expanding vs sliding window#
A sliding (rolling) window keeps the training set at a fixed size. This means the model is always trained on the same amount of data, which can reduce bias when older data is no longer representative of current conditions.
Let us run the same experiment with window="rolling" and compare aggregate MAE.
from sklearn.preprocessing import RobustScaler, StandardScaler
from twiga import TwigaForecaster
from twiga.core.config import DataPipelineConfig, ForecasterConfig
from twiga.models.ml import LIGHTGBMConfig
sliding_config = ForecasterConfig(
split_freq="months",
train_size=3,
test_size=1,
window="rolling", # fixed-size training window
stride=1,
num_splits=3,
project_name="backtesting-sliding",
)
# Fresh data config (same parameters)
data_config_sliding = DataPipelineConfig(
target_feature="NetLoad(kW)",
period="30min",
latitude=32.371666,
longitude=-16.274998,
calendar_features=["hour", "day_night"],
exogenous_features=["Ghi"],
forecast_horizon=48,
lookback_window_size=96,
input_scaler=StandardScaler(),
target_scaler=RobustScaler(),
)
forecaster_sliding = TwigaForecaster(
data_params=data_config_sliding,
model_params=[LIGHTGBMConfig()],
train_params=sliding_config,
)
# Visualise sliding scheme
plot_sliding = forecaster_sliding.plot_split_scheme(data=data)
plot_sliding
from IPython.display import clear_output
forecast_sliding_df, metrics_sliding_df = forecaster_sliding.backtesting(data=data)
clear_output()
mae_expanding = metrics_df.groupby("Model")["mae"].mean().round(2)
mae_sliding = metrics_sliding_df.groupby("Model")["mae"].mean().round(2)
comparison = pd.DataFrame(
{
"Expanding MAE": mae_expanding,
"Sliding MAE": mae_sliding,
}
)
comparison["Difference"] = (comparison["Sliding MAE"] - comparison["Expanding MAE"]).round(2)
log.info("Strategy comparison — aggregate MAE across 3 folds:\n%s", comparison.to_string())
Interpretation - A negative “Difference” (Sliding MAE < Expanding MAE) means rolling window outperforms expanding on this dataset, suggesting that recent data is more predictive than older history. A difference of < 0.1 kW is within noise - prefer the simpler expanding strategy unless you have a strong domain reason to believe the distribution is drifting. If the difference is large (> 0.5 kW), investigate when the regime changed by plotting fold-level MAE over time.
Which strategy to choose?#
The right strategy depends on how your data distribution evolves over time.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
choice_df = pd.DataFrame(
{
"Situation": [
"Distribution is stable over time",
"Concept drift / seasonal regime changes",
"Limited data available",
],
"Recommended strategy": [
"Expanding — more data per fold, lower variance",
"Rolling — model stays focused on recent patterns",
"Expanding — avoids throwing away useful early observations",
],
}
)
twiga_gt(
GT(choice_df)
.tab_header(
title=md("**Window Strategy Decision Guide**"),
subtitle="Expanding vs. Rolling — choose based on your data",
)
.cols_label(
Situation=md("**Situation**"),
**{"Recommended strategy": md("**Recommended strategy**")},
)
.tab_source_note("Twiga Forecast"),
n_rows=len(choice_df),
)
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
decision_df = pd.DataFrame(
{
"Situation": [
"Distribution is stable over time",
"Concept drift / seasonal regime changes",
"Limited data (< 1 year)",
],
"Recommended strategy": ["Expanding", "Rolling", "Expanding"],
"Reason": [
"More data per fold → lower variance, older data is still informative",
"Model stays focused on recent patterns, discards stale distributions",
"Avoids throwing away useful early observations; every row counts",
],
}
)
twiga_gt(
GT(decision_df)
.tab_header(
title=md("**Strategy Selection Guide**"),
subtitle="When to use expanding vs. rolling window",
)
.cols_label(
Situation=md("**Situation**"),
**{"Recommended strategy": md("**Recommended**")},
Reason=md("**Reason**"),
)
.tab_source_note("Twiga Forecast"),
n_rows=len(decision_df),
)
10 Plotting the forecast over time#
A visual sanity-check: overlay the model forecast against actuals for the first week of backtesting results (7 days × 48 half-hour steps = 336 rows per target).
p = plot_forecast(
forecast_df.reset_index(),
date_col="timestamp",
actual_col="Actual",
forecast_col="forecast",
n_samples=7 * 48,
title="Backtesting forecast — first 7 days",
y_label="Net Load (kW)",
)
p # noqa: B018
# Per-fold MAE bar chart
if fold_col is not None:
fold_mae = metrics_df.groupby(["Model", fold_col])["mae"].mean().reset_index()
fold_mae_renamed = fold_mae.rename(columns={"mae": "MAE", fold_col: "Fold"})
fold_mae_renamed["Model"] = fold_mae_renamed["Model"] + " (Fold " + fold_mae_renamed["Fold"].astype(str) + ")"
p = plot_metrics_bar(
fold_mae_renamed,
metric_col="MAE",
model_col="Model",
lower_is_better=True,
title="MAE per fold — expanding window",
x_label="MAE",
horizontal=True,
)
p # noqa: B018
Wrapping up#
What you did
Loaded the MLVS-PT dataset and understood why a single split is unreliable
Configured
ForecasterConfigwith expanding window, 3 folds, 2-month test periodsVisualised the CV split scheme with
plot_split_scheme()before trainingRan
forecaster.backtesting()and inspected bothforecast_dfandmetrics_dfAnalysed per-fold MAE variance and diagnosed fold-level instability
Compared expanding vs. rolling window strategies on aggregate MAE
Read the
twiga_reportoutput and interpreted teal-highlighted cells
Key takeaways
A single train/test split is a point estimate - backtesting gives you a distribution of scores and reveals seasonal instability.
Always visualise splits before training;
plot_split_scheme()catches misconfiguredtrain_size/test_sizebefore you waste compute.Per-fold variance is as important as the mean - a model with low mean MAE but high variance is unreliable in production.
Expanding window suits stable distributions; rolling window suits drifting ones - let the fold-level variance guide your choice.
Save
forecast_dfandmetrics_dfto parquet after a long backtesting run so you can re-analyse without re-training.
What’s next?#
Learn how to train deep learning architectures (MLPF, MLPGAM, MLPGAF, N-HiTS) using the same TwigaForecaster API, and compare NN models against the ML baselines you evaluated here.
If you want to jump ahead to automated hyperparameter search, see NB10 - Hyperparameter Tuning, which covers
forecaster.tune(), Bayesian optimisation, and resumable SQLite studies.
# ruff: noqa: E501, E701, E702
from IPython.display import HTML
_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"
steps = [
{
"num": "04",
"title": "Time Series Differencing",
"desc": "Stationarity · differencing · inversion",
"tags": ["differencing", "stationarity"],
"active": False,
},
{
"num": "05",
"title": "ML Point Forecasting",
"desc": "CatBoost · XGBoost · LightGBM · model comparison",
"tags": ["catboost", "xgboost", "lightgbm"],
"active": False,
},
{
"num": "06",
"title": "Backtesting & Evaluation",
"desc": "Rolling-window backtesting · fold-level metrics · strategy comparison",
"tags": ["backtesting", "evaluation", "cross-validation"],
"active": True,
},
{
"num": "07",
"title": "Neural Networks",
"desc": "MLPF · N-HiTS · Lightning training loop",
"tags": ["neural network", "pytorch", "lightning"],
"active": False,
},
{
"num": "08",
"title": "Quantile Regression",
"desc": "First probabilistic step — prediction intervals",
"tags": ["probabilistic", "quantile", "intervals"],
"active": False,
},
]
track_name = "Point Forecasting Track"
footer = 'Next: explore <span style="color:#107591;font-weight:600;">neural networks</span> (07) or jump to <span style="color:#107591;font-weight:600;">probabilistic forecasting</span> (08–10).'
def _b(t, bg, fg):
return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'
ch = ""
for i, s in enumerate(steps):
a = s["active"]
cb = _TEAL if a else _WHITE
cbo = _TEAL if a else "#d1ecf1"
nb = _TEAL_MID if a else _TEAL_LIGHT
nf = _WHITE if a else _TEAL
tf = _WHITE if a else _TEXT_DARK
df = "#cce8ef" if a else _TEXT_MUTED
bb = "#0d5f75" if a else _TEAL_BEST
bf = "#b8e4ed" if a else _TEAL
yh = (
f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
if a
else ""
)
bdg = "".join(_b(t, bb, bf) for t in s["tags"])
ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
if i < len(steps) - 1:
ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'
HTML(
f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)