Ensemble Strategies#
Source Files
twiga/forecaster/ensemble.py-EnsembleStrategy,compute_ensemble_predictions()
When you register multiple models with TwigaForecaster, you can combine their predictions using ensemble strategies. This page explains how each strategy works, when to use it, and how to apply it.
Overview#
An ensemble combines predictions from multiple models into a single blended forecast. Twiga supports three strategies:
Mean - Average of all model predictions
Median - Middle value across models
Weighted - Custom weighted sum
Using Ensembles#
Pass ensemble_strategy to any prediction or evaluation method:
from twiga.forecaster.core import TwigaForecaster
forecaster = TwigaForecaster(
data_params=data_config,
model_params=[model_1, model_2, model_3], # 3 models
)
forecaster.fit(train_df)
# Get ensemble predictions alongside individual model predictions
results, metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="mean",
)
# Results now include an "Ensemble" row in addition to individual models
print(results)
# timestamp | target | model | forecast | actual
# ...
# 2024-01-01 | load_mw | xgboost | 1234 | 1250
# 2024-01-01 | load_mw | lightgbm | 1245 | 1250
# 2024-01-01 | load_mw | catboost | 1240 | 1250
# 2024-01-01 | load_mw | Ensemble | 1240 | 1250 <- mean of the three
Available in these methods:
predict(),predict_interval(),forecast()evaluate_point_forecast(),evaluate_interval_forecast(),evaluate_quantile_forecast()evaluate(),calibrate(),backtesting()
Mean Ensemble#
Computes the arithmetic mean of all model predictions.
Formula#
Where M is the number of models.
When to Use#
All models are roughly equally accurate
You want a simple, interpretable blend
You expect random errors to cancel out
Example#
results, metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="mean",
)
Advantages#
No hyperparameter tuning needed
Interpretable: each model contributes equally
Reduces variance if models make independent errors
Disadvantages#
Equally weights good and bad models
Can be pulled off by outlier models
No adaptation per sample or time period
Median Ensemble#
Computes the median value across all model predictions.
Formula#
When to Use#
Models have outlier predictions that skew the mean
You want robustness to one or two bad models
Forecast error distributions are heavy-tailed
Example#
results, metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="median",
)
Advantages#
Robust to outlier predictions from one model
No hyperparameter tuning
Works well with odd numbers of models (3, 5, 7)
Disadvantages#
Less interpretable than mean
Can be less accurate if all models are good
Ignores magnitude of disagreement between models
Weighted Ensemble#
Assigns custom weights to each model and computes the weighted sum.
Formula#
Where \(\sum w_m = 1.0\) (weights must sum to 1).
When to Use#
Models have different accuracy levels
You have domain knowledge about relative reliability
You’ve tuned weights via backtesting or validation
Example#
ensemble_weights = {
"xgboost": 0.5, # 50%
"lightgbm": 0.3, # 30%
"catboost": 0.2, # 20%
}
results, metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="weighted",
ensemble_weights=ensemble_weights,
)
Weight Selection#
Option 1: Equal confidence If all models are similarly accurate, use equal weights:
weights = {model_name: 1/n_models for model_name in model_names}
Option 2: Inverse MAE Weight models inversely by their validation error:
# After evaluating on validation set, compute MAE per model
mae_dict = {"xgboost": 50, "lightgbm": 60, "catboost": 55}
inv_mae = {m: 1/e for m, e in mae_dict.items()}
total = sum(inv_mae.values())
weights = {m: w/total for m, w in inv_mae.items()}
# Result: {xgboost: 0.343, lightgbm: 0.286, catboost: 0.371}
Option 3: Ranking Higher rank (lower error) gets higher weight:
# xgboost is best (rank 1), lightgbm second (rank 2), catboost third (rank 3)
ranks = {"xgboost": 1, "lightgbm": 2, "catboost": 3}
inv_rank = {m: 1/r for m, r in ranks.items()}
total = sum(inv_rank.values())
weights = {m: w/total for m, w in inv_rank.items()}
# Result: {xgboost: 0.545, lightgbm: 0.273, catboost: 0.182}
Advantages#
Leverages accuracy differences between models
Can significantly improve performance
Allows domain knowledge to influence the blend
Disadvantages#
Requires weight tuning (backtesting or validation)
Weights that overfit on one period may not transfer
More complex than mean/median
Comparing Strategies#
Strategy |
Computation |
Robustness |
Tuning |
Best For |
|---|---|---|---|---|
Mean |
Sum all, divide by M |
Medium |
None |
Baseline; equally good models |
Median |
Sort, take middle |
High |
None |
Heavy outliers; robust blend |
Weighted |
w1·m1 + w2·m2 + … |
Medium |
Weight selection |
Heterogeneous accuracy |
Working with Different Forecast Kinds#
Ensemble strategies work consistently across forecast kinds:
Point Forecasts#
results, metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="weighted",
ensemble_weights=weights,
)
# Combines loc (point) predictions
Interval Forecasts#
results, metrics = forecaster.evaluate_interval_forecast(
test_df,
ensemble_strategy="mean",
)
# Combines lower, point, and upper independently
# Result: [mean(lower), mean(point), mean(upper)]
Quantile Forecasts#
results, metrics = forecaster.evaluate_quantile_forecast(
test_df,
ensemble_strategy="median",
)
# Combines quantile predictions per quantile level
Parametric (Mean + Scale)#
results, metrics = forecaster.evaluate_parametric_forecast(
test_df,
ensemble_strategy="weighted",
ensemble_weights=weights,
)
# Combines loc and scale independently
Validating Ensemble Quality#
After creating an ensemble, validate it:
# On validation or test set, compare individual models to ensemble
results_df, metrics_df = forecaster.evaluate_point_forecast(test_df, ensemble_strategy="weighted")
# metrics_df has one row per model plus one for "Ensemble"
print(metrics_df)
# model mae rmse r2
# 0 xgboost 45.2 58.3 0.92
# 1 lightgbm 48.1 61.5 0.91
# 2 catboost 46.5 59.8 0.92
# 3 Ensemble 43.1 55.2 0.93 <- Should be good
# Check if ensemble is better than best individual
best_model_mae = metrics_df[metrics_df["model"] != "Ensemble"]["mae"].min()
ensemble_mae = metrics_df[metrics_df["model"] == "Ensemble"]["mae"].values[0]
print(f"Ensemble MAE: {ensemble_mae}, Best model: {best_model_mae}")
Tuning Weights via Backtesting#
Optimize weights on validation data:
from scipy.optimize import minimize
def evaluate_weights(weights_array, model_names, forecaster, val_df):
"""Objective: minimize ensemble MAE"""
weights = {name: w for name, w in zip(model_names, weights_array)}
_, metrics = forecaster.evaluate_point_forecast(
val_df,
ensemble_strategy="weighted",
ensemble_weights=weights,
)
ensemble_mae = metrics[metrics["model"] == "Ensemble"]["mae"].values[0]
return ensemble_mae
model_names = ["xgboost", "lightgbm", "catboost"]
initial_weights = [1/3, 1/3, 1/3]
result = minimize(
lambda w: evaluate_weights(w, model_names, forecaster, val_df),
x0=initial_weights,
method="Nelder-Mead",
constraints=[{"type": "eq", "fun": lambda w: sum(w) - 1.0}], # sum = 1
bounds=[(0, 1) for _ in model_names], # all in [0, 1]
)
optimal_weights = {name: w for name, w in zip(model_names, result.x)}
print(f"Optimal weights: {optimal_weights}")
# Evaluate on test set with optimized weights
test_results, test_metrics = forecaster.evaluate_point_forecast(
test_df,
ensemble_strategy="weighted",
ensemble_weights=optimal_weights,
)
See Also#
Model Registry - Register multiple models
Backtesting - Rolling validation for weight tuning
Forecast Results - Understanding the output
Metrics - Evaluating ensembles