Backtesting & Robust Evaluation#

What you’ll build

A multi-fold backtesting pipeline for LightGBM using both expanding-window and rolling-window strategies, with per-fold metric visualisation and a strategy comparison table that shows which approach is more appropriate for your problem.

Prerequisites

01 - Getting Started (DataPipelineConfig, ForecasterConfig, TwigaForecaster.fit)
05 - ML Point Forecasting (multi-model training and metric interpretation)
Python: pandas basics, list comprehensions

Learning objectives

By the end of this notebook you will be able to:

Explain why a single train/test split is insufficient for time series evaluation
Configure expanding-window and rolling-window cross-validation via ForecasterConfig
Visualise the split scheme with plot_split_scheme() to confirm no data leakage
Interpret per-fold metric distributions and identify unstable models
Decide between expanding and rolling strategies based on dataset characteristics

1 Setup#

import warnings

warnings.filterwarnings("ignore")

from great_tables import GT
from lets_plot import LetsPlot
import pandas as pd

LetsPlot.setup_html()

from twiga.core.plot import (
    plot_forecast,
    plot_forecast_grid,
    plot_metrics_bar,
)
from twiga.core.utils import configure, get_logger

configure()
log = get_logger("tutorials")

data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature"])
data["timestamp"] = pd.to_datetime(data["timestamp"])

# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)

log.info("Dataset shape : %s", data.shape)
log.info("Date range    : %s -> %s", data["timestamp"].min(), data["timestamp"].max())
GT(data.round(2).head())

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature"])
data["timestamp"] = pd.to_datetime(data["timestamp"])

# Restrict to 2019-2020 to keep tutorial execution fast

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
   use_nullable_dtypes = False
check_dtype_backend(dtype_backend)
--> 669 return impl.read(
   path,
   columns=columns,
   filters=filters,
   storage_options=storage_options,
   use_nullable_dtypes=use_nullable_dtypes,
   dtype_backend=dtype_backend,
   filesystem=filesystem,
   **kwargs,
)

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
if manager == "array":
   to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
   path,
   filesystem,
   storage_options=storage_options,
   mode="rb",
)
try:
   pa_table = self.api.parquet.read_table(
       path_or_handle,
       columns=columns,
   (...)    270         **kwargs,
   )

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
handles = None
if (
   not fs
   and not is_dir
   (...)    139     # fsspec resources can also point to directories
   # this branch is used for example when reading from non-fsspec URLs
--> 141     handles = get_handle(
       path_or_handle, mode, is_text=False, storage_options=storage_options
   )
   fs = None
   path_or_handle = handles.handle

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
       handle = open(
           handle,
           ioargs.mode,
   (...)    878             newline="",
       )
   else:
       # Binary mode
--> 882         handle = open(handle, ioargs.mode)
   handles.append(handle)
# Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'

2 Why a single train/test split is insufficient#

Key concept - temporal leakage

In standard ML, shuffling data before splitting is common practice. For time series this is catastrophic: if any training sample is drawn from after the test period, the model has effectively “seen the future” during training. Twiga’s TimeBasedCV prevents this by construction - every training fold ends strictly before its corresponding test fold starts.

Key concept - why backtesting matters

A single train/test split gives one performance number. That number depends entirely on which period you happened to pick - a mild winter month looks very different from a heat-wave month for net-load forecasting. Backtesting replaces that single estimate with a distribution of performance scores across many non-overlapping test windows. The mean is more reliable; the variance tells you whether the model is consistent or seasonal-brittle.

When we evaluate a forecasting model on a single fixed test period we get one performance number. Time series data also drifts: seasons change, demand patterns evolve, solar irradiance profiles shift. A model trained on data from January may generalise poorly to August.

Key concept - expanding-window cross-validation

Instead of one fixed test window, Twiga evaluates across multiple consecutive folds. With window="expanding", the training set grows with each fold - Fold 1 trains on 5 months, Fold 2 on 6 months, and so on. Each fold tests on a fresh period the model never saw. The final reported metric is the mean across all folds, which is far more reliable than a single-split estimate and reveals whether performance is stable over time.

Time-based cross-validation addresses both problems:

We evaluate across multiple non-overlapping test windows - this reduces the variance of our performance estimate and reveals how consistent the model is across time.
We always respect the arrow of time: training data always precede test data, so there is no data leakage.

Twiga implements this via TimeBasedCV, configured through ForecasterConfig. Two window strategies are available - the table below summarises the trade-offs.

from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

strategy_df = pd.DataFrame(
    {
        "Strategy": ["Expanding", "Rolling (sliding)"],
        "Description": [
            "Training set grows with each fold — Fold 1: N months, Fold 2: N+1 months, ...",
            "Training window slides forward at fixed size — always trains on exactly N months",
        ],
        "Best for": [
            "Stable distributions — older data is still informative",
            "Drifting distributions — recent data matters more than old data",
        ],
        "Trade-off": [
            "More data per fold → lower variance, but may include stale patterns",
            "Fixed data volume → consistent training conditions, but discards early data",
        ],
    }
)

twiga_gt(
    GT(strategy_df)
    .tab_header(
        title=md("**Cross-Validation Window Strategies**"),
        subtitle="Expanding vs. Rolling — when to use which",
    )
    .cols_label(
        Strategy=md("**Strategy**"),
        Description=md("**Description**"),
        **{"Best for": md("**Best for**")},
        **{"Trade-off": md("**Trade-off**")},
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(strategy_df),
)

from lets_plot import (
    aes,
    facet_wrap,
    geom_rect,
    ggplot,
    ggsize,
    labs,
    scale_fill_manual,
    scale_y_continuous,
)
import pandas as pd

from twiga.core.plot.theme import TWIGA_PALETTE, twiga_theme

_N_FOLDS = 5
_TRAIN_SIZE = 3  # relative time units
_TEST_SIZE = 1
_BAR_H = 0.55

_FILL = {"Train": TWIGA_PALETTE[0], "Test": TWIGA_PALETTE[1]}


def _make_cv_df(strategy: str) -> pd.DataFrame:
    rows = []
    for fold in range(_N_FOLDS):
        y = fold + 1
        t_start = 0 if strategy == "Expanding" else fold
        t_end = _TRAIN_SIZE + fold if strategy == "Expanding" else fold + _TRAIN_SIZE
        rows += [
            {
                "xmin": t_start,
                "xmax": t_end,
                "ymin": y - _BAR_H / 2,
                "ymax": y + _BAR_H / 2,
                "segment": "Train",
                "strategy": strategy,
            },
            {
                "xmin": t_end,
                "xmax": t_end + _TEST_SIZE,
                "ymin": y - _BAR_H / 2,
                "ymax": y + _BAR_H / 2,
                "segment": "Test",
                "strategy": strategy,
            },
        ]
    return pd.DataFrame(rows)


_df = pd.concat([_make_cv_df("Expanding"), _make_cv_df("Rolling")], ignore_index=True)
_df["strategy"] = _df["strategy"].map(
    {
        "Expanding": "Expanding window  (train grows each fold)",
        "Rolling": "Rolling window  (fixed train size, slides forward)",
    }
)

(
    ggplot(_df)
    + geom_rect(
        aes(xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax", fill="segment"),
        size=0,
        alpha=0.88,
    )
    + facet_wrap("strategy", ncol=1)
    + scale_fill_manual(values=_FILL)
    + scale_y_continuous(
        breaks=list(range(1, _N_FOLDS + 1)),
        labels=[f"Fold {i}" for i in range(1, _N_FOLDS + 1)],
    )
    + labs(x="Time (months)", y="", fill="", title="Cross-validation strategies")
    + twiga_theme(grid=False, legend_pos="top")
    + ggsize(980, 600)
)

3 ForecasterConfig: the split parameters#

All cross-validation behaviour is controlled by ForecasterConfig. The table below documents each parameter.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

params_df = pd.DataFrame(
    {
        "Parameter": ["split_freq", "train_size", "test_size", "window", "stride", "num_splits", "gap", "project_name"],
        "Type": ["str", "int", "int", "str", "int", "int", "int", "str"],
        "Meaning": [
            'Unit for all size parameters ("months", "days", "hours", …)',
            "How many split_freq units each training fold covers",
            "How many split_freq units each test fold covers",
            '"expanding" (train grows) or "rolling" (fixed window slides)',
            "How far the test window moves forward between folds",
            "Maximum number of folds to generate",
            "Unused periods between train end and test start (prevents leakage on slow-moving features)",
            "Identifier used for logging and artefact paths",
        ],
    }
)

twiga_gt(
    GT(params_df)
    .tab_header(
        title=md("**ForecasterConfig — Cross-Validation Parameters**"),
        subtitle="All temporal splits are controlled by these fields",
    )
    .cols_label(
        Parameter=md("**Parameter**"),
        Type=md("**Type**"),
        Meaning=md("**Meaning**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(params_df),
)

from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

params_df = pd.DataFrame(
    {
        "Parameter": ["split_freq", "train_size", "test_size", "window", "stride", "num_splits", "gap", "project_name"],
        "Type": ["str", "int", "int", "str", "int", "int", "int", "str"],
        "Meaning": [
            "Unit for all size parameters: 'months', 'days', 'hours', …",
            "How many split_freq units each training fold covers",
            "How many split_freq units each test fold covers",
            "'expanding' — train grows each fold; 'rolling' — fixed train size",
            "How far the test window moves forward between folds",
            "Maximum number of folds to generate",
            "Unused periods between train end and test start (avoids overlap)",
            "Identifier used for logging and artefact paths",
        ],
    }
)

twiga_gt(
    GT(params_df)
    .tab_header(
        title=md("**ForecasterConfig Parameters**"),
        subtitle="All CV behaviour is controlled through these settings",
    )
    .cols_label(
        Parameter=md("**Parameter**"),
        Type=md("**Type**"),
        Meaning=md("**Meaning**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(params_df),
)

from sklearn.preprocessing import RobustScaler, StandardScaler

from twiga.core.config import DataPipelineConfig, ForecasterConfig

data_config = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    period="30min",
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night"],
    exogenous_features=["Ghi"],
    forecast_horizon=48,
    lookback_window_size=96,
    input_scaler=StandardScaler(),
    target_scaler=RobustScaler(),
)

train_config = ForecasterConfig(
    split_freq="months",
    train_size=3,
    test_size=1,
    window="expanding",
    stride=1,
    num_splits=3,
    project_name="backtesting-tutorial",
)

log.info("DataPipelineConfig:\n%s", data_config.model_dump(exclude_none=True))
log.info("ForecasterConfig:\n%s", train_config.model_dump(exclude_none=True))

4 Instantiate the forecaster#

from twiga import TwigaForecaster
from twiga.models.ml import LIGHTGBMConfig

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params=[LIGHTGBMConfig()],
    train_params=train_config,
)

log.info("Forecaster ready.")
log.info("  window   : %s", forecaster.window)
log.info("  splits   : %s", forecaster.num_splits)
log.info("  train_sz : %s %s", forecaster.train_size, forecaster.split_freq)
log.info("  test_sz  : %s %s", forecaster.test_size, forecaster.split_freq)

5 Visualising the split scheme#

Before training a single model it is good practice to inspect the split layout. Twiga provides plot_split_scheme(), which returns a lets_plot chart. Calling it with data=data also initialises the internal split state, so the same scheme is reused during backtesting.

plot = forecaster.plot_split_scheme(data=data)
plot

The chart shows the train (blue) and test (orange) windows for each fold. Because we chose window="expanding", each blue bar is longer than the previous one - the model sees progressively more history as we move forward in time.

6 Running backtesting#

from IPython.display import clear_output

forecast_df, metrics_df = forecaster.backtesting(data=data)
clear_output()

log.info("forecast_df shape: %s", forecast_df.shape)
log.info("metrics_df shape : %s", metrics_df.shape)
log.info("metrics_df columns: %s", metrics_df.columns.tolist())

log.info("=== Forecast DataFrame (first 5 rows) ===\n%s", forecast_df.head().to_string())
log.info("=== Metrics DataFrame (first 5 rows) ===\n%s", metrics_df.head().to_string())

What the DataFrames contain#

backtesting() returns two DataFrames. The tables below document each column.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

forecast_schema = pd.DataFrame(
    {
        "Column": ["Model", "target", "forecast", "Actual", "Folds"],
        "Description": [
            "Model name in upper-case (e.g. LIGHTGBM)",
            "Target variable name from DataPipelineConfig",
            "Model point prediction",
            "Observed value from the test set",
            "Fold index (1-based) indicating which CV fold this row belongs to",
        ],
    }
)

twiga_gt(
    GT(forecast_schema)
    .tab_header(
        title=md("**`forecast_df` — one row per timestep per model**"),
        subtitle="Indexed by UTC timestamp",
    )
    .cols_label(
        Column=md("**Column**"),
        Description=md("**Description**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(forecast_schema),
)

metrics_schema = pd.DataFrame(
    {
        "Column": ["target", "Model", "mae", "rmse", "corr", "wmape", "smape", "nbias", "Folds"],
        "Description": [
            "Target variable name",
            "Model name in upper-case",
            "Mean absolute error (same units as target)",
            "Root-mean-squared error (penalises large spikes)",
            "Pearson correlation between forecast and actual",
            "Weighted mean absolute percentage error",
            "Symmetric mean absolute percentage error",
            "Normalised bias (positive = systematic over-forecasting)",
            "Fold index (1-based)",
        ],
    }
)

twiga_gt(
    GT(metrics_schema)
    .tab_header(
        title=md("**`metrics_df` — one row per day per model**"),
        subtitle="Indexed by daily timestamp",
    )
    .cols_label(
        Column=md("**Column**"),
        Description=md("**Description**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(metrics_schema),
)

from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

forecast_cols = pd.DataFrame(
    {
        "Column": ["Model", "target", "forecast", "Actual", "Folds"],
        "Description": [
            "Model name in upper-case (e.g. LIGHTGBM)",
            "Target variable name from DataPipelineConfig",
            "Model point prediction (inverse-scaled to original units)",
            "Observed value from the test fold",
            "Fold index (1-based)",
        ],
        "DataFrame": ["forecast_df"] * 5,
    }
)

metrics_cols = pd.DataFrame(
    {
        "Column": ["target", "Model", "mae", "rmse", "corr", "wmape", "smape", "nbias", "Folds"],
        "Description": [
            "Target variable name",
            "Model name",
            "Mean absolute error",
            "Root-mean-squared error",
            "Pearson correlation (actuals vs. forecasts)",
            "Weighted mean absolute percentage error",
            "Symmetric mean absolute percentage error",
            "Normalised bias (signed average error)",
            "Fold index",
        ],
        "DataFrame": ["metrics_df"] * 9,
    }
)

combined = pd.concat([forecast_cols, metrics_cols], ignore_index=True)

twiga_gt(
    GT(combined)
    .tab_header(
        title=md("**Backtesting Output Columns**"),
        subtitle="forecast_df and metrics_df returned by forecaster.backtesting()",
    )
    .cols_label(
        Column=md("**Column**"),
        Description=md("**Description**"),
        DataFrame=md("**DataFrame**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(combined),
)

7 Per-fold results#

fold_col = "Folds" if "Folds" in metrics_df.columns else ("fold" if "fold" in metrics_df.columns else None)

if fold_col is not None:
    fold_metrics = metrics_df.groupby(["Model", fold_col])[["mae", "rmse", "corr"]].mean().round(3)
    log.info("Per-fold metrics (MAE / RMSE / Corr):\n%s", fold_metrics.to_string())
else:
    log.info("No fold column found. Available columns: %s", metrics_df.columns.tolist())

agg_metrics = metrics_df.groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
log.info("Aggregate metrics (mean over all folds):\n%s", agg_metrics.to_string())

Interpreting per-fold variance#

Interpretation - Scan the per-fold MAE column. A stable spread (e.g., all folds within ±0.5 kW of the mean) indicates the model generalises well across seasons. A large spike in one fold (e.g., Fold 2 MAE = 3.9 vs. the 3.3 - 3.7 range for others) points to a specific time window where the model struggles - investigate what was different in that period (unusual weather, holiday effects, data quality issues).

High variance across folds (e.g., fold 1 MAE = 50 but fold 6 MAE = 120) indicates that model performance degrades as the dataset grows older or seasonal conditions change. If you observe this pattern consider:

Adding more seasonal calendar features
Retraining more frequently in production
Switching from expanding to rolling window to prevent stale data dominating

8 Reading the performance table#

twiga_report wraps Great Tables to produce a colour-highlighted summary. Green cells mark the best value for each metric (lower is better for error metrics, higher is better for correlation).

from twiga.core.plot.gt import twiga_report

res = metrics_df.groupby("Model")[["mae", "corr", "nbias", "rmse", "wmape", "smape"]].mean().round(2).reset_index()

res = res.rename(
    columns={
        "mae": "MAE",
        "corr": "Corr",
        "wmape": "WMAPE",
        "smape": "SMAPE",
        "nbias": "NBIAS",
        "rmse": "RMSE",
    }
)

twiga_report(
    res,
    metrics=["MAE", "Corr", "SMAPE", "RMSE"],
    minimize_cols=["MAE", "SMAPE", "RMSE"],
    maximize_cols=["Corr"],
)

Interpretation - The teal-highlighted cells mark the best value per column. For a single-model backtesting run, all highlights will be in the LIGHTGBM row by definition. The value of this table grows when you add multiple models (see NB04): compare MAE, RMSE, and Corr together - a model that wins on MAE but loses on Corr may be fitting noise rather than signal. An RMSE significantly larger than MAE indicates sporadic large errors; investigate those forecast timesteps directly in forecast_df.

9 Expanding vs sliding window#

A sliding (rolling) window keeps the training set at a fixed size. This means the model is always trained on the same amount of data, which can reduce bias when older data is no longer representative of current conditions.

Let us run the same experiment with window="rolling" and compare aggregate MAE.

from sklearn.preprocessing import RobustScaler, StandardScaler

from twiga import TwigaForecaster
from twiga.core.config import DataPipelineConfig, ForecasterConfig
from twiga.models.ml import LIGHTGBMConfig

sliding_config = ForecasterConfig(
    split_freq="months",
    train_size=3,
    test_size=1,
    window="rolling",  # fixed-size training window
    stride=1,
    num_splits=3,
    project_name="backtesting-sliding",
)

# Fresh data config (same parameters)
data_config_sliding = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    period="30min",
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night"],
    exogenous_features=["Ghi"],
    forecast_horizon=48,
    lookback_window_size=96,
    input_scaler=StandardScaler(),
    target_scaler=RobustScaler(),
)

forecaster_sliding = TwigaForecaster(
    data_params=data_config_sliding,
    model_params=[LIGHTGBMConfig()],
    train_params=sliding_config,
)

# Visualise sliding scheme
plot_sliding = forecaster_sliding.plot_split_scheme(data=data)
plot_sliding

from IPython.display import clear_output

forecast_sliding_df, metrics_sliding_df = forecaster_sliding.backtesting(data=data)
clear_output()

mae_expanding = metrics_df.groupby("Model")["mae"].mean().round(2)
mae_sliding = metrics_sliding_df.groupby("Model")["mae"].mean().round(2)

comparison = pd.DataFrame(
    {
        "Expanding MAE": mae_expanding,
        "Sliding MAE": mae_sliding,
    }
)
comparison["Difference"] = (comparison["Sliding MAE"] - comparison["Expanding MAE"]).round(2)

log.info("Strategy comparison — aggregate MAE across 3 folds:\n%s", comparison.to_string())

Interpretation - A negative “Difference” (Sliding MAE < Expanding MAE) means rolling window outperforms expanding on this dataset, suggesting that recent data is more predictive than older history. A difference of < 0.1 kW is within noise - prefer the simpler expanding strategy unless you have a strong domain reason to believe the distribution is drifting. If the difference is large (> 0.5 kW), investigate when the regime changed by plotting fold-level MAE over time.

Which strategy to choose?#

The right strategy depends on how your data distribution evolves over time.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

choice_df = pd.DataFrame(
    {
        "Situation": [
            "Distribution is stable over time",
            "Concept drift / seasonal regime changes",
            "Limited data available",
        ],
        "Recommended strategy": [
            "Expanding — more data per fold, lower variance",
            "Rolling — model stays focused on recent patterns",
            "Expanding — avoids throwing away useful early observations",
        ],
    }
)

twiga_gt(
    GT(choice_df)
    .tab_header(
        title=md("**Window Strategy Decision Guide**"),
        subtitle="Expanding vs. Rolling — choose based on your data",
    )
    .cols_label(
        Situation=md("**Situation**"),
        **{"Recommended strategy": md("**Recommended strategy**")},
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(choice_df),
)

from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

decision_df = pd.DataFrame(
    {
        "Situation": [
            "Distribution is stable over time",
            "Concept drift / seasonal regime changes",
            "Limited data (< 1 year)",
        ],
        "Recommended strategy": ["Expanding", "Rolling", "Expanding"],
        "Reason": [
            "More data per fold → lower variance, older data is still informative",
            "Model stays focused on recent patterns, discards stale distributions",
            "Avoids throwing away useful early observations; every row counts",
        ],
    }
)

twiga_gt(
    GT(decision_df)
    .tab_header(
        title=md("**Strategy Selection Guide**"),
        subtitle="When to use expanding vs. rolling window",
    )
    .cols_label(
        Situation=md("**Situation**"),
        **{"Recommended strategy": md("**Recommended**")},
        Reason=md("**Reason**"),
    )
    .tab_source_note("Twiga Forecast"),
    n_rows=len(decision_df),
)

10 Plotting the forecast over time#

A visual sanity-check: overlay the model forecast against actuals for the first week of backtesting results (7 days × 48 half-hour steps = 336 rows per target).

p = plot_forecast(
    forecast_df.reset_index(),
    date_col="timestamp",
    actual_col="Actual",
    forecast_col="forecast",
    n_samples=7 * 48,
    title="Backtesting forecast — first 7 days",
    y_label="Net Load (kW)",
)
p  # noqa: B018

# Per-fold MAE bar chart
if fold_col is not None:
    fold_mae = metrics_df.groupby(["Model", fold_col])["mae"].mean().reset_index()
    fold_mae_renamed = fold_mae.rename(columns={"mae": "MAE", fold_col: "Fold"})
    fold_mae_renamed["Model"] = fold_mae_renamed["Model"] + " (Fold " + fold_mae_renamed["Fold"].astype(str) + ")"

    p = plot_metrics_bar(
        fold_mae_renamed,
        metric_col="MAE",
        model_col="Model",
        lower_is_better=True,
        title="MAE per fold — expanding window",
        x_label="MAE",
        horizontal=True,
    )
    p  # noqa: B018

Wrapping up#

What you did

Loaded the MLVS-PT dataset and understood why a single split is unreliable
Configured ForecasterConfig with expanding window, 3 folds, 2-month test periods
Visualised the CV split scheme with plot_split_scheme() before training
Ran forecaster.backtesting() and inspected both forecast_df and metrics_df
Analysed per-fold MAE variance and diagnosed fold-level instability
Compared expanding vs. rolling window strategies on aggregate MAE
Read the twiga_report output and interpreted teal-highlighted cells

Key takeaways

A single train/test split is a point estimate - backtesting gives you a distribution of scores and reveals seasonal instability.
Always visualise splits before training; plot_split_scheme() catches misconfigured train_size / test_size before you waste compute.
Per-fold variance is as important as the mean - a model with low mean MAE but high variance is unreliable in production.
Expanding window suits stable distributions; rolling window suits drifting ones - let the fold-level variance guide your choice.
Save forecast_df and metrics_df to parquet after a long backtesting run so you can re-analyse without re-training.

What’s next?#

NB06 - Neural Network Models

Learn how to train deep learning architectures (MLPF, MLPGAM, MLPGAF, N-HiTS) using the same TwigaForecaster API, and compare NN models against the ML baselines you evaluated here.

If you want to jump ahead to automated hyperparameter search, see NB10 - Hyperparameter Tuning, which covers forecaster.tune(), Bayesian optimisation, and resumable SQLite studies.

# ruff: noqa: E501, E701, E702
from IPython.display import HTML

_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"

steps = [
    {
        "num": "04",
        "title": "Time Series Differencing",
        "desc": "Stationarity · differencing · inversion",
        "tags": ["differencing", "stationarity"],
        "active": False,
    },
    {
        "num": "05",
        "title": "ML Point Forecasting",
        "desc": "CatBoost · XGBoost · LightGBM · model comparison",
        "tags": ["catboost", "xgboost", "lightgbm"],
        "active": False,
    },
    {
        "num": "06",
        "title": "Backtesting & Evaluation",
        "desc": "Rolling-window backtesting · fold-level metrics · strategy comparison",
        "tags": ["backtesting", "evaluation", "cross-validation"],
        "active": True,
    },
    {
        "num": "07",
        "title": "Neural Networks",
        "desc": "MLPF · N-HiTS · Lightning training loop",
        "tags": ["neural network", "pytorch", "lightning"],
        "active": False,
    },
    {
        "num": "08",
        "title": "Quantile Regression",
        "desc": "First probabilistic step — prediction intervals",
        "tags": ["probabilistic", "quantile", "intervals"],
        "active": False,
    },
]
track_name = "Point Forecasting Track"
footer = 'Next: explore <span style="color:#107591;font-weight:600;">neural networks</span> (07) or jump to <span style="color:#107591;font-weight:600;">probabilistic forecasting</span> (08–10).'


def _b(t, bg, fg):
    return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'


ch = ""
for i, s in enumerate(steps):
    a = s["active"]
    cb = _TEAL if a else _WHITE
    cbo = _TEAL if a else "#d1ecf1"
    nb = _TEAL_MID if a else _TEAL_LIGHT
    nf = _WHITE if a else _TEAL
    tf = _WHITE if a else _TEXT_DARK
    df = "#cce8ef" if a else _TEXT_MUTED
    bb = "#0d5f75" if a else _TEAL_BEST
    bf = "#b8e4ed" if a else _TEAL
    yh = (
        f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
        if a
        else ""
    )
    bdg = "".join(_b(t, bb, bf) for t in s["tags"])
    ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
    if i < len(steps) - 1:
        ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'

HTML(
    f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)