Ensemble Strategies#
What you’ll build
A 6-model ensemble (LinearReg, LightGBM, XGBoost, CatBoost, MLPF, N-HiTS) evaluated with mean, median, and inverse-error weighted-mean strategies, compared on MAE and RMSE to identify when ensembling beats individual models.
Prerequisites
05 - ML Point Forecasting (multi-model setup)
07 - Neural Networks (MLPF, NHiTS)
11 - Hyperparameter Tuning (optional, for tuned component models)
Python: basic statistics
Learning objectives
By the end of this notebook you will be able to:
Explain bias-variance decomposition and why diverse models ensemble better than similar ones
Configure a multi-model
TwigaForecasterwith ML and NN components togetherApply mean, median, and weighted-mean ensemble strategies via
ensemble_strategyDerive the inverse-error weighting formula and implement it manually
Identify when an ensemble adds value and when it is not worth the inference cost
1. Motivation#
A single model can fail for many reasons: a poorly tuned learning rate, a bad random seed, a blind spot in its inductive bias. When several models make uncorrelated errors, averaging them cancels noise and reduces variance - a classic result from the bias-variance decomposition.
The key condition is diversity: if two models fail on exactly the same days, averaging does nothing. When a tree-based model and a neural network fail on different days, their average is better than either alone.
Rule of thumb: if the pairwise correlation of individual model errors is below ~0.9, an ensemble will likely improve over the best single model.
Key concept - ensemble learning
The bias-variance decomposition tells us that total prediction error = bias² + variance + irreducible noise. A single high-capacity model often has low bias but high variance (it overfits specific training patterns). Averaging multiple models keeps bias low while reducing variance - provided the models make different errors.
This is the “wisdom of crowds” effect: a crowd of diverse, independently reasoning individuals is collectively more accurate than any single expert, even if that expert is slightly better on average. In machine learning terms:
Similar models (same family, same features) share correlated errors → averaging them reduces variance only marginally.
Diverse models (different families, different inductive biases) have lower error correlation → averaging them achieves larger variance reduction.
Mixing tree-based models (which partition feature space with axis-aligned splits) with neural networks (which learn smooth, distributed representations) is one of the most reliable diversity strategies available.
2. Setup#
import warnings
from great_tables import GT
from lets_plot import LetsPlot
import pandas as pd
LetsPlot.setup_html()
from twiga.core.plot import (
plot_forecast_grid,
plot_metrics_bar,
)
from twiga.core.utils import configure, get_logger
warnings.filterwarnings("ignore")
configure()
log = get_logger("tutorials")
Load data#
The dataset covers Madeira, Portugal (32.37°N, 16.27°W) at 30-minute resolution. We load only the columns we need.
data = pd.read_parquet("../data/MLVS-PT.parquet")
data = data[["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]]
data["timestamp"] = pd.to_datetime(data["timestamp"])
data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)
log.info("Shape: %s", data.shape)
GT(data.head())
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 data = pd.read_parquet("../data/MLVS-PT.parquet")
2 data = data[["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]]
3 data["timestamp"] = pd.to_datetime(data["timestamp"])
4 data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
256 if manager == "array":
257 to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
261 storage_options=storage_options,
262 mode="rb",
263 )
264 try:
265 pa_table = self.api.parquet.read_table(
266 path_or_handle,
267 columns=columns,
(...) 270 **kwargs,
271 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
131 handles = None
132 if (
133 not fs
134 and not is_dir
(...) 139 # fsspec resources can also point to directories
140 # this branch is used for example when reading from non-fsspec URLs
--> 141 handles = get_handle(
142 path_or_handle, mode, is_text=False, storage_options=storage_options
143 )
144 fs = None
145 path_or_handle = handles.handle
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
873 handle = open(
874 handle,
875 ioargs.mode,
(...) 878 newline="",
879 )
880 else:
881 # Binary mode
--> 882 handle = open(handle, ioargs.mode)
883 handles.append(handle)
885 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'
Train / val / test splits#
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
splits_df = pd.DataFrame(
{
"Split": ["Train", "Validation", "Test"],
"Period": ["before 2020-01-01", "2020-01-01 – 2020-06-30", "2020-07-01 onwards"],
"Purpose": [
"Model learning across CV folds",
"Weighted ensemble weight calibration",
"Final honest evaluation",
],
}
)
twiga_gt(
GT(splits_df)
.tab_header(
title=md("**Dataset Splits**"),
subtitle="Chronological — no shuffling, no overlap",
)
.cols_label(**{c: md(f"**{c}**") for c in splits_df.columns})
.tab_source_note("MLVS-PT dataset · Madeira, Portugal · 30-min resolution"),
n_rows=len(splits_df),
)
train_df = data[data["timestamp"] < "2020-01-01"].reset_index(drop=True)
val_df = data[(data["timestamp"] >= "2020-01-01") & (data["timestamp"] < "2020-07-01")].reset_index(drop=True)
test_df = data[data["timestamp"] >= "2020-07-01"].reset_index(drop=True)
log.info(
f"train : {train_df.shape[0]:,} rows ({train_df['timestamp'].min().date()} → {train_df['timestamp'].max().date()})"
)
log.info(f"val : {val_df.shape[0]:,} rows ({val_df['timestamp'].min().date()} → {val_df['timestamp'].max().date()})")
log.info(
f"test : {test_df.shape[0]:,} rows ({test_df['timestamp'].min().date()} → {test_df['timestamp'].max().date()})"
)
3. Building a multi-model forecaster#
Passing multiple model configs to TwigaForecaster trains all of them in a single fit() call. Here we combine four ML tree models with two NN models.
Why mix ML and NN? Tree models (LightGBM, XGBoost, CatBoost) are strong on tabular feature interactions; NN models (MLPF, MLPGAM) learn sequential patterns and temporal structure. Their error patterns tend to be complementary - ideal for ensembling.
from IPython.display import clear_output
from twiga import TwigaForecaster
from twiga.models.ml import CATBOOSTConfig, LIGHTGBMConfig, LINEAREGConfig, XGBOOSTConfig
from twiga.models.nn import MLPFConfig, MLPGAMConfig
mlpf_config = MLPFConfig.from_data_config(data_config)
mlpf_config.max_epochs = 2
mlpf_config.rich_progress_bar = False
mlpgam_config = MLPGAMConfig.from_data_config(data_config)
mlpgam_config.max_epochs = 2
mlpgam_config.rich_progress_bar = False
forecaster = TwigaForecaster(
data_params=data_config,
model_params=[
LINEAREGConfig(),
LIGHTGBMConfig(),
XGBOOSTConfig(device="cpu"),
CATBOOSTConfig(task_type="CPU"),
mlpf_config,
mlpgam_config,
],
train_params=train_config,
)
forecaster.fit(train_df=train_df, val_df=val_df)
clear_output()
log.info("All models trained.")
4. Strategy 1: Mean#
The simplest strategy: average predictions from all models equally. Works well when models are roughly equally accurate and errors are independent.
pred_mean, metric_mean = forecaster.evaluate_point_forecast(test_df=test_df, ensemble_strategy="mean")
clear_output()
agg_mean = metric_mean.groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
log.info("Mean ensemble — metrics by model:")
log.info("\n%s", agg_mean.to_string())
5. Strategy 2: Median#
The median is robust to outlier models. If one model produces a wildly inaccurate forecast on a given day, the median ignores it - unlike the mean, which is pulled toward extreme values.
Use
medianwhen you suspect that one or more models may occasionally produce large errors. Usemeanwhen all models are well-calibrated and you want to exploit all available information.
pred_median, metric_median = forecaster.evaluate_point_forecast(test_df=test_df, ensemble_strategy="median")
clear_output()
agg_median = metric_median.groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
log.info("Median ensemble — metrics by model:")
log.info("\n%s", agg_median.to_string())
6. Strategy 3: Weighted#
In the weighted ensemble, each model’s contribution is scaled by its validation performance. Models that achieved lower validation error receive a higher weight - in effect, the ensemble automatically up-weights the best models.
Weights are computed internally from the validation data passed to fit() and are inversely proportional to each model’s validation MAE. You do not need to specify weights manually.
Key concept - weighted ensemble
In the weighted strategy, each model’s contribution is scaled inversely to its validation MAE: a model with half the MAE of another gets twice the weight. Concretely, if model i has validation MAE = eᵢ, its weight is:
wᵢ = (1 / eᵢ) / Σⱼ (1 / eⱼ)When to prefer weighted over simple mean:
When models differ significantly in quality (one model is clearly better than others)
When you want the ensemble to adapt automatically without manual tuning
When to prefer mean:
When all models have similar validation errors - weighting adds noise without benefit
When the validation set is small and MAE estimates are noisy (weighting amplifies noise)
Important caveat: the weights are derived from the validation set passed to
fit(). If the validation period is not representative of the test period, the weights may be miscalibrated. Always verify that the weighted ensemble outperforms the mean on a hold-out test set before committing to it in production.
pred_weighted, metric_weighted = forecaster.evaluate_point_forecast(test_df=test_df, ensemble_strategy="weighted")
clear_output()
agg_weighted = metric_weighted.groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
log.info("Weighted ensemble — metrics by model:")
log.info("\n%s", agg_weighted.to_string())
7. Per-model breakdown#
The metrics DataFrame contains a row for every model - including the ENSEMBLE row. Let’s isolate individual model performance and compare it directly to the ensemble.
# Find the ensemble label (it is stored as the uppercased string "ENSEMBLE")
log.info("Unique Model labels in metric_mean: %s", metric_mean["Model"].unique().tolist())
ensemble_label = "ENSEMBLE"
individual = (
metric_mean[metric_mean["Model"] != ensemble_label].groupby("Model")[["mae", "rmse", "corr"]].mean().round(3)
)
ensemble_row = metric_mean[metric_mean["Model"] == ensemble_label][["mae", "rmse", "corr"]].mean().round(3)
log.info("Individual models (mean across folds):")
log.info("\n%s", individual.to_string())
log.info("\nEnsemble — Mean strategy:")
log.info("\n%s", ensemble_row.to_frame().T.to_string(index=False))
Bar chart: individual models vs ensemble#
bar_data = individual["mae"].copy()
bar_data[ensemble_label] = ensemble_row["mae"]
bar_df = bar_data.sort_values().reset_index()
bar_df.columns = ["Model", "MAE"]
p = plot_metrics_bar(
bar_df,
metric_col="MAE",
model_col="Model",
lower_is_better=True,
title="Individual models vs Ensemble (Mean strategy) — MAE (lower is better)",
x_label="MAE (mean across folds)",
horizontal=True,
)
p
8. Strategy comparison table#
Aggregate the ensemble row from each strategy into a single comparison table.
from twiga.core.plot.gt import twiga_report
def ensemble_summary(metric_df, strategy_label):
row = metric_df[metric_df["Model"] == ensemble_label][["mae", "rmse", "corr", "wmape", "smape"]].mean().round(3)
return {
"Strategy": strategy_label,
"MAE": row["mae"],
"RMSE": row["rmse"],
"Corr": row["corr"],
"WMAPE": row["wmape"],
"SMAPE": row["smape"],
}
comparison = pd.DataFrame(
[
ensemble_summary(metric_mean, "Mean"),
ensemble_summary(metric_median, "Median"),
ensemble_summary(metric_weighted, "Weighted"),
]
)
log.info("Ensemble strategy comparison:")
log.info("\n%s", comparison.to_string(index=False))
twiga_report(
comparison.rename(columns={"Strategy": "Model"}),
["MAE", "RMSE", "Corr", "SMAPE"],
minimize_cols=["MAE", "RMSE", "SMAPE"],
maximize_cols=["Corr"],
)
9. Forecast plot: ensemble vs best individual model#
Visual comparison over the first 7 days of the test set.
n_steps = 7 * 48 # 7 days at 30-min resolution
# Best individual model by MAE
best_model = individual["mae"].idxmin()
df_ensemble = pred_mean[pred_mean["Model"] == ensemble_label].iloc[:n_steps].copy()
df_ensemble["Model"] = "ENSEMBLE"
df_best = pred_mean[pred_mean["Model"] == best_model].iloc[:n_steps].copy()
df_best["Model"] = f"Best: {best_model}"
combined_plot = pd.concat([df_ensemble, df_best], ignore_index=True)
p = plot_forecast_grid(
combined_plot,
actual_col="Actual",
forecast_col="forecast",
model_col="Model",
n_samples_per_model=n_steps,
title="Ensemble vs best individual model — first 7 days of test set",
y_label="Net Load (kW)",
)
p
10. When ensembles help vs don’t#
Ensembles help when:
Models have diverse error patterns (e.g. ML + NN combined)
Individual models are accurate but not on the same days
You want robustness without hyperparameter tuning a single model
Ensembles don’t help much when:
All models overfit in the same way (same family, same features)
All models are underfitting - averaging underfits does not help
Models make identical errors (pairwise error correlation > 0.9)
Quick error correlation check#
Pivot per-model errors and compute pairwise correlation. High correlation (> 0.9) between two models means combining them adds little.
# Compute per-model absolute errors relative to actuals
indiv_preds = pred_mean[pred_mean["Model"] != ensemble_label].copy()
indiv_preds["error"] = (indiv_preds["forecast"] - indiv_preds["Actual"]).abs()
# Reset index to get timestamp as column, then pivot
indiv_reset = indiv_preds.reset_index()
ts_col = indiv_reset.columns[0] # first column is the timestamp index
pivot = indiv_reset.pivot_table(index=ts_col, columns="Model", values="error")
log.info("Pairwise error correlation (individual models):")
log.info("\n%s", pivot.corr().round(2).to_string())
Error correlation heatmap#
from twiga.core.plot import correlation_heatmap
corr_matrix = pivot.corr()
p = correlation_heatmap(
corr_matrix,
title="Pairwise absolute error correlation\n(lower = more diverse = more ensemble benefit)",
)
p
11. Mixing ML and NN#
Tree-based ML models and NN models have complementary inductive biases - a key reason this combination produces lower-correlated errors than within-family ensembles.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
model_family_df = pd.DataFrame(
{
"Model family": ["LightGBM / XGBoost / CatBoost", "MLPF / MLPGAM", "Linear Regression"],
"Inductive bias": [
"Axis-aligned tree splits, tabular feature interactions",
"Smooth distributed representations, sequential temporal patterns",
"Global linear relationship, interpretable coefficients",
],
"Strength": [
"Non-monotonic splits, fast training, handles missing values natively",
"Learns temporal structure and latent embeddings across horizons",
"Fast, interpretable baseline; regularises ensemble toward linear fit",
],
"Diversity vs. others": ["High (vs NN)", "High (vs ML)", "Medium"],
}
)
twiga_gt(
GT(model_family_df)
.tab_header(
title=md("**ML vs NN — Inductive Bias & Ensemble Diversity**"),
subtitle="Different learning mechanisms → lower error correlation → more ensemble benefit",
)
.cols_label(**{c: md(f"**{c}**") for c in model_family_df.columns})
.tab_source_note("With max_epochs=5, NN models may underfit; increase to 50–100 for production"),
n_rows=len(model_family_df),
)
12. Strategy comparison: twiga_gt summary#
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
strategy_guide_df = pd.DataFrame(
{
"Strategy": ["`mean`", "`median`", "`weighted`"],
"Behaviour": [
"Equal-weight average of all model predictions",
"Median across models — robust to outlier predictions",
"Validation-MAE-proportional weights; better models contribute more",
],
"Use when": [
"All models are well-calibrated and errors are roughly independent",
"One or more models may occasionally spike — robustness is more important than optimality",
"Models differ significantly in quality and validation set is representative",
],
"Pitfall": [
"One bad model drags the average down",
"Ignores all available quality information",
"Miscalibrated if validation period is unrepresentative of test",
],
}
)
twiga_gt(
GT(strategy_guide_df)
.tab_header(
title=md("**Ensemble Strategy Selection Guide**"),
subtitle="Choose based on model quality spread and robustness requirements",
)
.cols_label(**{c: md(f"**{c}**") for c in strategy_guide_df.columns})
.tab_source_note("Twiga Forecast · evaluate_point_forecast(ensemble_strategy=...)"),
n_rows=len(strategy_guide_df),
)
Wrapping up#
What you did
Registered 6 models (LinearReg, LightGBM, XGBoost, CatBoost, MLPF, MLPGAM) in one
TwigaForecasterApplied mean, median, and weighted ensemble strategies in a single
evaluate_point_forecast()callCompared ensemble vs. best individual model with a styled metric table and bar chart
Visualised forecast plots to see where the ensemble outperforms individual models
Diagnosed ensemble benefit by computing pairwise error correlations and a heatmap
Reviewed inductive-bias differences between ML and NN model families
Key takeaways
Diversity beats quantity - two uncorrelated models ensemble better than five correlated ones. Check pairwise error correlation before adding more models.
Mixing ML and NN is the most reliable diversity strategy: tree splits vs. smooth learned representations produce complementary errors.
Mean is the safest default; switch to weighted only when models differ substantially in quality and your validation set is representative.
Median is preferable when you suspect occasional catastrophic failures in one model - it ignores extreme values.
If pairwise error correlation is > 0.9 for all pairs, tuning a single model (NB10) will outperform any ensemble strategy.
What’s next?#
Notebook 12 shows how to register a completely custom model class - including any sklearn-compatible estimator - in the Twiga model registry, giving you full control over the model architecture while keeping the rest of the pipeline unchanged.
# ruff: noqa: E501, E701, E702
from IPython.display import HTML
_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"
steps = [
{
"num": "06",
"title": "Backtesting & Evaluation",
"desc": "Rolling-window backtesting · fold-level metrics",
"tags": ["backtesting", "evaluation"],
"active": False,
},
{
"num": "10",
"title": "Conformal Prediction",
"desc": "CQR · CRC — coverage-guaranteed intervals",
"tags": ["conformal", "CQR"],
"active": False,
},
{
"num": "11",
"title": "Hyperparameter Tuning",
"desc": "Optuna TPE · search spaces · resumable SQLite",
"tags": ["optuna", "HPO", "tuning"],
"active": False,
},
{
"num": "12",
"title": "Ensemble Strategies",
"desc": "Mean · median · inverse-error weighted — reduce forecast variance",
"tags": ["ensemble", "weighted", "aggregation"],
"active": True,
},
{
"num": "13",
"title": "Custom Models",
"desc": "Register your own model class in the Twiga registry",
"tags": ["custom", "registry", "sklearn"],
"active": False,
},
]
track_name = "Advanced Track"
footer = 'Next: extend the library further with <span style="color:#107591;font-weight:600;">Custom Models</span> (13) or add explainability with <span style="color:#107591;font-weight:600;">SHAP Attribution</span> (16).'
def _b(t, bg, fg):
return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'
ch = ""
for i, s in enumerate(steps):
a = s["active"]
cb = _TEAL if a else _WHITE
cbo = _TEAL if a else "#d1ecf1"
nb = _TEAL_MID if a else _TEAL_LIGHT
nf = _WHITE if a else _TEAL
tf = _WHITE if a else _TEXT_DARK
df = "#cce8ef" if a else _TEXT_MUTED
bb = "#0d5f75" if a else _TEAL_BEST
bf = "#b8e4ed" if a else _TEAL
yh = (
f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
if a
else ""
)
bdg = "".join(_b(t, bb, bf) for t in s["tags"])
ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
if i < len(steps) - 1:
ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'
HTML(
f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)