Zero-Shot Foundation Models#
Author: Anthony Faustine, sambaiga@gmail.com
What you’ll build
Five zero-shot probabilistic forecasts on the same dataset, using five different foundation models:
Chronos-2 - Amazon’s 12B-parameter autoregressive transformer (21 quantiles).
TabICLv2 - A tabular foundation model that performs forecasting via in-context learning on engineered temporal features (9 quantiles).
Moirai 2.0 - Salesforce’s universal time-series transformer; quantile output via the
uni2tspackage (9 quantiles).TimesFM 2.5 - Google’s 200M decoder-only transformer with a 16k context and a continuous quantile head (9 quantiles).
Lag-Llama - A decoder-only transformer using lag features; the first open-source TSFM (sample-based → 9 quantiles).
All five download pre-trained weights and produce quantile predictions without training on your data. We then benchmark them against a seasonal-naive baseline.
Prerequisites
01 - Getting Started (DataPipelineConfig, TwigaForecaster)
08 - Quantile Regression (quantile concepts, prediction intervals)
A GPU is helpful but not required (inference on CPU works, just slower)
Learning objectives
By the end of this notebook you will be able to:
Explain the paradigm shift: pre-trained, task-agnostic models vs. data-specific learning
Load and evaluate five zero-shot quantile forecasters side-by-side
Compare different foundation-model architectures (autoregressive, in-context tabular, universal transformer, lag-feature decoder)
Interpret when foundation models excel vs. when domain-specific fine-tuning wins
Visualise quantile fan charts and coverage plots from foundation models
1. The Foundation Model Paradigm#
Traditional approach (Notebooks 05-10):
Your data → Feature engineering → Train model → Evaluate
Model learns from your dataset only
Hyperparameter tuning required (5–100× slower)
Strong performance on YOUR domain IF you have enough data
Weak on new, unfamiliar patterns
Foundation model approach:
Pre-trained on 2M+ time series → Download weights → Forecast immediately
Model learned from internet-scale diversity (energy, traffic, retail, weather, …)
Zero training time - deploy in seconds
Works reasonably on most domains WITHOUT fine-tuning
May underperform on highly specialised data
Chronos-2 Architecture#
Chronos-2 is an autoregressive transformer (like GPT, but for time series):
Input: tokenise the lookback window into a sequence of values
Encoder-Decoder: transformer learns to predict the next value given prior context
Output: 21 quantile values in parallel (τ = 0.025, 0.05, …, 0.975)
The 21 quantile levels are fixed by Chronos-2’s architecture - you cannot request arbitrary quantiles, but you get a rich probabilistic forecast “for free”.
When to use Chronos-2#
Use Case |
Recommendation |
|---|---|
Rapid prototyping |
✓ Use Chronos-2 first - validate the problem in minutes |
Small datasets (<1000 samples) |
✓ Chronos-2 may outperform underfitted QR models |
New domains (unfamiliar patterns) |
✓ Chronos-2’s pre-training covers many scenarios |
Production on limited data |
✓ No fine-tuning risk, stable predictions |
Large, domain-specific datasets |
✗ Fine-tuned QR or NN models often win |
Extreme latency constraints |
✗ Chronos-2 inference is ~seconds per window (batch > 1 helps) |
Custom quantile levels |
✗ You are locked to the 21 fixed quantiles |
2. Setup#
import warnings
from great_tables import GT, md
from IPython.display import clear_output
from lets_plot import LetsPlot
import pandas as pd
from twiga import TwigaForecaster
from twiga.core.config import DataPipelineConfig, ExperimentConfig
from twiga.core.plot import plot_forecast_grid, plot_metrics_bar
from twiga.core.plot.gt import twiga_gt, twiga_report
from twiga.core.utils import configure, get_logger
LetsPlot.setup_html()
warnings.filterwarnings("ignore")
configure()
log = get_logger("tutorials")
Load data#
data = pd.read_parquet("../data/MLVS-PT.parquet")
data = data[["timestamp", "NetLoad(kW)"]].copy()
data["timestamp"] = pd.to_datetime(data["timestamp"])
data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
# Restrict to 2019-2020 for fast tutorial execution
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)
log.info("Shape: %s", data.shape)
log.info("Period: %s -> %s", data["timestamp"].min().date(), data["timestamp"].max().date())
2026-06-14 20:31:10 | INFO | twiga.tutorials | Shape: (33553, 2)
2026-06-14 20:31:10 | INFO | twiga.tutorials | Period: 2019-02-01 -> 2020-12-31
Train / test split#
train_df = data[data["timestamp"] < "2020-07-01"].reset_index(drop=True)
test_df = data[data["timestamp"] >= "2020-07-01"].reset_index(drop=True)
log.info(
"train : %d rows (%s -> %s)",
len(train_df),
train_df["timestamp"].min().date(),
train_df["timestamp"].max().date(),
)
log.info(
"test : %d rows (%s -> %s)",
len(test_df),
test_df["timestamp"].min().date(),
test_df["timestamp"].max().date(),
)
2026-06-14 20:31:10 | INFO | twiga.tutorials | train : 24766 rows (2019-02-01 -> 2020-06-30)
2026-06-14 20:31:10 | INFO | twiga.tutorials | test : 8787 rows (2020-07-01 -> 2020-12-31)
3. Zero-Shot Chronos-2 Foundation Model#
Key property: Zero-shot means fit() does not train.
The TwigaForecaster.fit() method:
Validates input dimensions
Loads the pre-trained Chronos-2 weights from HuggingFace
Returns immediately (no optimization loop)
All CV folds run through the same pre-trained model - there is no per-fold retraining.
from twiga.models.foundational.chronos2_model import Chronos2Config
# device="cuda" if you have a GPU; "cpu" works but is slower
chronos_config = Chronos2Config(device="cpu")
forecaster_chronos = TwigaForecaster(
data_params=data_config,
model_params=[chronos_config],
cv_params=train_config,
)
# fit() downloads the model but does NOT train
forecaster_chronos.fit(train_df=train_df)
# clear_output()
log.info("Chronos-2 zero-shot model ready for inference.")
2026-06-14 20:31:10 | INFO | twiga.forecaster.base | ─────────────────── Training chronos2 ────────────────────
2026-06-14 20:31:12 | INFO | twiga.models.foundational.chronos2_model | Loading amazon/chronos-2 on device=cpu
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
2026-06-14 20:31:13 | INFO | twiga.models.foundational.chronos2_model | Chronos-2 ready device=cpu horizon=48
2026-06-14 20:31:13 | INFO | twiga.forecaster.base | Training chronos2 complete duration=0.04 min
2026-06-14 20:31:13 | INFO | twiga.tutorials | Chronos-2 zero-shot model ready for inference.
Evaluate Chronos-2’s quantile forecast#
Chronos-2 outputs 21 quantiles. We evaluate it exactly like a trained QR model.
pred_chronos, metric_chronos = forecaster_chronos.evaluate_quantile_forecast(test_df=test_df)
clear_output()
log.info("Chronos-2 evaluation complete.")
log.info("Predictions shape: %s", pred_chronos.shape)
log.info("Quantile columns: %s", [c for c in pred_chronos.columns if c.startswith("q_")][:5])
2026-06-14 20:31:16 | INFO | twiga.tutorials | Chronos-2 evaluation complete.
2026-06-14 20:31:16 | INFO | twiga.tutorials | Predictions shape: (8784, 5)
2026-06-14 20:31:16 | INFO | twiga.tutorials | Quantile columns: []
def get_metric_table(metric_df: pd.DataFrame, metric_cols: list | None = None):
if metric_cols is None:
metric_cols = ["mae", "corr", "pinball", "crps", "sharpness", "calibration_error"]
res = metric_df.groupby("Model")[metric_cols].mean().round(2).reset_index()
res = res.rename(
columns={
"mae": "MAE",
"corr": "Corr",
"pinball": "PINBALL",
"crps": "CRPS",
"sharpness": "SHARPNESS",
"calibration_error": "Cal-err",
}
)
metric_name = ["MAE", "Corr", "PINBALL", "CRPS", "SHARPNESS", "Cal-err"]
minimize_cols = ["MAE", "SMAPE", "RMSE"]
maximize_cols = ["Corr"]
return twiga_report(res, metric_name, minimize_cols, maximize_cols)
get_metric_table(metric_chronos)
| Model Performance | ||||||
| Metric comparison across all evaluated models | ||||||
| Model | MAE | Corr | PINBALL | CRPS | SHARPNESS | Cal-err |
|---|---|---|---|---|---|---|
| CHRONOS2 | 3.770 | 0.890 | 1.310 | 2.680 | 8.510 | 0.120 |
| Twiga Forecast | ||||||
4. Zero-Shot TabICL Foundation Model#
TabICLv2 takes a different architectural route to zero-shot forecasting: rather than autoregressively decoding tokenised values, it engineers temporal features (lags, calendar context, etc.) and runs in-context learning with a tabular foundation model. The pre-training mixes synthetic and real tabular datasets, so it generalises across domains without per-task fitting.
Same Twiga interface - only the model swap differs.
9 quantiles by default:
[0.1, 0.2, ..., 0.9]Inference is ensembled across
n_estimatorsmembers (4–8 is a good speed/quality trade-off on CPU)
from twiga.models.foundational.tabicl_model import TabICLConfig
# device="cuda" if you have a GPU; "cpu" works but is slower
tabicl_config = TabICLConfig(device="cpu", n_estimators=4, max_context_length=48 * 30)
forecaster_tabicl = TwigaForecaster(
data_params=data_config,
model_params=[tabicl_config],
cv_params=train_config,
)
# fit() instantiates the forecaster; weights are loaded lazily on first predict
try:
forecaster_tabicl.fit(train_df=train_df)
tabicl_available = True
log.info("TabICL zero-shot model ready for inference.")
except ImportError as e:
tabicl_available = False
log.warning("TabICL unavailable (numpy<2.2 conflict with gluonts): %s", e)
2026-06-14 20:31:16 | INFO | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────
2026-06-14 20:31:16 | INFO | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:16 | WARNING | twiga.tutorials | TabICL unavailable (numpy<2.2 conflict with gluonts): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).
Evaluate TabICL’s quantile forecast#
TabICL produces 9 quantiles. We evaluate it through the same Twiga quantile-forecast pathway as Chronos-2.
import pandas as pd
if tabicl_available:
pred_tabicl, metric_tabicl = forecaster_tabicl.evaluate_quantile_forecast(test_df=test_df)
clear_output()
log.info("TabICL evaluation complete.")
log.info("Predictions shape: %s", pred_tabicl.shape)
else:
pred_tabicl = pd.DataFrame()
metric_tabicl = pd.DataFrame()
log.info("TabICL skipped - not available in this environment.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | TabICL skipped - not available in this environment.
if tabicl_available:
get_metric_table(metric_tabicl)
else:
log.info("TabICL skipped - no metrics table to display.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | TabICL skipped - no metrics table to display.
5. Zero-Shot Moirai Foundation Model#
Moirai (Salesforce Research) is a universal time-series transformer trained on a massive multi-domain corpus. We use Moirai 2.0, which produces native quantile output. The same wrapper also supports Moirai 1.x and Moirai-MoE via the model_type config field.
Moirai is loaded through the uni2ts package and runs the GluonTS predictor pipeline under the hood; Twiga reshapes the output into the same (batch, n_quantiles, horizon) contract as the other foundation models.
from twiga.models.foundational.moirai_model import MoiraiConfig
try:
moirai_config = MoiraiConfig(
model_type="moirai2",
size="small",
device="cpu",
frequency="30min",
)
forecaster_moirai = TwigaForecaster(
data_params=data_config,
model_params=[moirai_config],
cv_params=train_config,
)
forecaster_moirai.fit(train_df=train_df)
moirai_available = True
log.info("Moirai zero-shot model ready for inference.")
except ImportError as e:
moirai_available = False
pred_moirai = pd.DataFrame()
metric_moirai = pd.DataFrame()
log.warning("Moirai unavailable (uni2ts not installed): %s", e)
2026-06-14 20:31:16 | INFO | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────
2026-06-14 20:31:16 | INFO | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:16 | WARNING | twiga.tutorials | Moirai unavailable (uni2ts not installed): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'
Evaluate Moirai’s quantile forecast#
Moirai 2.0 emits quantile outputs natively. Twiga reduces these to its standard 9-quantile grid and evaluates via the same quantile-forecast path as Chronos-2 and TabICL.
if moirai_available:
pred_moirai, metric_moirai = forecaster_moirai.evaluate_quantile_forecast(test_df=test_df)
clear_output()
log.info("Moirai evaluation complete.")
log.info("Predictions shape: %s", pred_moirai.shape)
else:
log.info("Moirai skipped - not available in this environment.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | Moirai skipped - not available in this environment.
if moirai_available:
get_metric_table(metric_moirai)
else:
log.info("Moirai skipped - no metrics table to display.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | Moirai skipped - no metrics table to display.
6. Zero-Shot TimesFM Foundation Model#
TimesFM 2.5 (Google Research) is a 200M-parameter decoder-only transformer with a 16k-token context and a continuous quantile head. Like Chronos-2 it forecasts autoregressively, but at a fraction of the parameter count, and it emits 9 quantile levels (0.1 … 0.9) natively.
TimesFM 2.5 is not on PyPI - install it from source:
pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]".
from twiga.models.foundational.timesfm_model import TimesFMConfig
try:
timesfm_config = TimesFMConfig(device="cpu")
forecaster_timesfm = TwigaForecaster(
data_params=data_config,
model_params=[timesfm_config],
cv_params=train_config,
)
forecaster_timesfm.fit(train_df=train_df)
timesfm_available = True
log.info("TimesFM zero-shot model ready for inference.")
except ImportError as e:
timesfm_available = False
pred_timesfm = pd.DataFrame()
metric_timesfm = pd.DataFrame()
log.warning("TimesFM unavailable (timesfm not installed): %s", e)
2026-06-14 20:31:16 | INFO | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────
2026-06-14 20:31:16 | INFO | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:16 | WARNING | twiga.tutorials | TimesFM unavailable (timesfm not installed): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"
Evaluate TimesFM’s quantile forecast#
TimesFM 2.5’s continuous quantile head yields 9 quantiles. We evaluate it through the same quantile-forecast path as the other foundation models.
if timesfm_available:
pred_timesfm, metric_timesfm = forecaster_timesfm.evaluate_quantile_forecast(test_df=test_df)
clear_output()
log.info("TimesFM evaluation complete.")
log.info("Predictions shape: %s", pred_timesfm.shape)
else:
log.info("TimesFM skipped - not available in this environment.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | TimesFM skipped - not available in this environment.
if timesfm_available:
get_metric_table(metric_timesfm)
else:
log.info("TimesFM skipped - no metrics table to display.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | TimesFM skipped - no metrics table to display.
7. Zero-Shot Lag-Llama Foundation Model#
Lag-Llama was the first open-source time-series foundation model: a decoder-only transformer that ingests lag features rather than raw sequences. It is sample-based; we draw 100 samples per series and reduce them to the standard 9-quantile grid.
Lag-Llama is not on PyPI as a forecasting checkpoint - install the Python package, then download the weights into the Twiga model directory. Run these from the project root (
twiga-forecast/):pip install git+https://github.com/time-series-foundation-models/lag-llama.git pip install "gluonts[torch]<=0.14.4" pytorch-lightning huggingface-cli download time-series-foundation-models/Lag-Llama \\ lag-llama.ckpt --local-dir twiga/models/foundational/lag-llamaThis places the checkpoint at
twiga/models/foundational/lag-llama/lag-llama.ckpt. The wrapper’s defaultckpt_path="lag-llama/lag-llama.ckpt"is resolved against the project root automatically, so it’s found regardless of where the notebook runs from.
from twiga.models.foundational.lag_llama_model import LagLlamaConfig
try:
lagllama_config = LagLlamaConfig(
ckpt_path="lag-llama/lag-llama.ckpt",
device="cpu",
num_samples=100,
)
forecaster_lagllama = TwigaForecaster(
data_params=data_config,
model_params=[lagllama_config],
cv_params=train_config,
)
forecaster_lagllama.fit(train_df=train_df)
lagllama_available = True
log.info("Lag-Llama zero-shot model ready for inference.")
except (ImportError, FileNotFoundError) as e:
lagllama_available = False
pred_lagllama = pd.DataFrame()
metric_lagllama = pd.DataFrame()
log.warning("Lag-Llama unavailable: %s", e)
2026-06-14 20:31:16 | INFO | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────
2026-06-14 20:31:16 | INFO | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:16 | WARNING | twiga.tutorials | Lag-Llama unavailable: Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.
Evaluate Lag-Llama’s quantile forecast#
Lag-Llama emits probabilistic samples. Twiga reduces these to its standard 9-quantile grid and evaluates via the same quantile-forecast path as the other foundation models.
if lagllama_available:
pred_lagllama, metric_lagllama = forecaster_lagllama.evaluate_quantile_forecast(test_df=test_df)
clear_output()
log.info("Lag-Llama evaluation complete.")
log.info("Predictions shape: %s", pred_lagllama.shape)
else:
log.info("Lag-Llama skipped - not available in this environment.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | Lag-Llama skipped - not available in this environment.
if lagllama_available:
get_metric_table(metric_lagllama)
else:
log.info("Lag-Llama skipped - no metrics table to display.")
2026-06-14 20:31:16 | INFO | twiga.tutorials | Lag-Llama skipped - no metrics table to display.
8. Benchmark: Five foundation models vs a simple baseline#
Compare all five zero-shot foundation models against a seasonal-naive baseline on the same test set.
from twiga.models.baseline.seasonal_naive_model import SEASONALNAIVEConfig
seasonal_config = SEASONALNAIVEConfig(period="7D", freq="30min")
forecaster_seasonal = TwigaForecaster(
data_params=data_config,
model_params=[seasonal_config],
cv_params=train_config,
)
forecaster_seasonal.fit(train_df=train_df)
clear_output()
pred_seasonal, metric_seasonal = forecaster_seasonal.evaluate_point_forecast(test_df=test_df)
clear_output()
get_metric_table(metric_seasonal, metric_cols=["mae", "corr"])
| Model Performance | ||
| Metric comparison across all evaluated models | ||
| Model | MAE | Corr |
|---|---|---|
| SEASONAL_NAIVE | 5.090 | 0.830 |
| Twiga Forecast | ||
metrics_all = pd.concat(
[
df
for df in [metric_chronos, metric_tabicl, metric_moirai, metric_timesfm, metric_lagllama, metric_seasonal]
if not df.empty
],
ignore_index=True,
)
res = (
metrics_all.groupby("Model", sort=False)[["mae", "rmse", "corr", "wmape", "smape", "nbias"]]
.mean()
.round(4)
.reset_index()
)
res = res.rename(
columns={
"mae": "MAE",
"corr": "Corr",
"wmape": "WMAPE",
"smape": "SMAPE",
"nbias": "NBIAS",
"rmse": "RMSE",
}
)
twiga_report(
res,
["MAE", "Corr", "SMAPE", "RMSE"],
["MAE", "SMAPE", "RMSE"],
["Corr"],
)
| Model Performance | ||||||
| Metric comparison across all evaluated models | ||||||
| Model | MAE | RMSE | Corr | WMAPE | SMAPE | NBIAS |
|---|---|---|---|---|---|---|
| CHRONOS2 | 3.772 | 5.066 | 0.891 | 10.2376 | 12.357 | -0.0709 |
| SEASONAL_NAIVE | 5.088 | 6.724 | 0.828 | 13.7421 | 16.690 | 0.5737 |
| Twiga Forecast | ||||||
9. Key Insights#
When learned (per-dataset) models win#
Large domain-specific datasets (> 10 k samples) with stable dynamics
Extreme precision requirements (financial trading, critical infrastructure)
Need custom quantile levels or strict interval shapes
Latency constraints (inference must be < 100 ms)
Phase 2: Fine-tuning#
Future Twiga releases will expose fine-tuning hooks for each foundation model:
Chronos2Config(device="cuda", fine_tune=True) # planned
TabICLConfig(fine_tune=True) # FinetunedTabICLRegressor already exists upstream
MoiraiConfig(model_type="moirai", fine_tune=True) # MoiraiFinetune supports Moirai 1.x / MoE
LagLlamaConfig(fine_tune=True) # Lag-Llama exposes a PyTorch Lightning training loop upstream
# TimesFM 2.5 fine-tuning is out of scope for the wrapper; use the upstream examples/finetuning recipe
This combines the best of both worlds: transfer learning from a broad pre-training prior plus domain-specific adaptation.
10. Forecast Traces#
Visual comparison of predictions across the test set.
preds_all = pd.concat(
[df for df in [pred_chronos, pred_tabicl, pred_moirai, pred_timesfm, pred_lagllama, pred_seasonal] if not df.empty],
ignore_index=True,
)
p = plot_forecast_grid(
preds_all,
actual_col="Actual",
forecast_col="forecast",
model_col="Model",
n_samples_per_model=7 * 48, # first 7 days of test set
y_label="Net Load (kW)",
title="Chronos-2 vs TabICL vs Moirai vs TimesFM vs Lag-Llama vs Seasonal-Naive - first 7 days of test set",
fig_width=1200,
)
p
11. Context-Length Sensitivity#
How much past does each foundation model actually need to see? We sweep the lookback window across two horizons:
Short-term (1-day ahead, h = 48 steps): contexts of 2 days · 1 week · 2 weeks · 1 month.
Long-term (1-week ahead, h = 336 steps): contexts of 1 week · 2 weeks · 1 month · 2 months · 3 months.
For each (model, context) pair we rebuild the DataPipelineConfig with a new lookback_window_size and forecast_horizon, then re-fit + re-evaluate against the same test_df. Each fit is zero-shot (no training), but CPU inference scales with horizon × context, so the long-term sweep takes meaningfully longer than the short-term one.
⚠️ TabICL’s default
max_context_lengthis 4096; the long-term sweep includes a 3-month context (4320 steps), so we bump that cap in the config for that sweep.
from twiga.models.foundational.chronos2_model import Chronos2Config
from twiga.models.foundational.lag_llama_model import LagLlamaConfig
from twiga.models.foundational.moirai_model import MoiraiConfig
from twiga.models.foundational.tabicl_model import TabICLConfig
from twiga.models.foundational.timesfm_model import TimesFMConfig
def make_model_configs(*, tabicl_max_context: int = 4096) -> list[tuple[str, object]]:
"""Build a fresh set of (display_name, config) pairs for one sweep iteration."""
return [
("Chronos-2", Chronos2Config(device="cpu")),
(
"TabICL",
TabICLConfig(device="cpu", n_estimators=4, max_context_length=tabicl_max_context),
),
(
"Moirai-2",
MoiraiConfig(model_type="moirai2", size="small", device="cpu", frequency="30min"),
),
("TimesFM", TimesFMConfig(device="cpu")),
("Lag-Llama", LagLlamaConfig(device="cpu", num_samples=100)),
]
def run_context_sweep(
*,
horizon: int,
contexts: list[tuple[str, int]],
train_df: pd.DataFrame,
test_df: pd.DataFrame,
tabicl_max_context: int = 4096,
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Evaluate each foundational model under each context length for a fixed horizon.
Returns ``(metrics_df, preds_df)`` annotated with ``context_label``,
``context_steps`` (in 30-min steps) and ``model_label`` columns ready to plot.
"""
base_dc_kwargs = {
"target_feature": "NetLoad(kW)",
"period": "30min",
"latitude": 32.371666,
"longitude": -16.274998,
"calendar_features": [],
"known_future_features": [],
"input_scaler": "standard",
"target_scaler": "robust",
}
metric_frames, pred_frames = [], []
for ctx_label, ctx_steps in contexts:
log.info("Context = %s (%d steps), horizon = %d", ctx_label, ctx_steps, horizon)
dc = DataPipelineConfig(
**base_dc_kwargs,
forecast_horizon=horizon,
lookback_window_size=ctx_steps,
window_stride=horizon, # non-overlapping eval windows
)
for model_label, model_config in make_model_configs(tabicl_max_context=tabicl_max_context):
log.info(" %s …", model_label)
try:
forecaster = TwigaForecaster(data_params=dc, model_params=[model_config], cv_params=train_config)
forecaster.fit(train_df=train_df)
preds, metrics = forecaster.evaluate_quantile_forecast(test_df=test_df)
clear_output()
metrics = metrics.assign(context_label=ctx_label, context_steps=ctx_steps, model_label=model_label)
preds = preds.assign(context_label=ctx_label, context_steps=ctx_steps, model_label=model_label)
metric_frames.append(metrics)
pred_frames.append(preds)
except (ImportError, FileNotFoundError) as e:
log.warning(" Skipping %s (unavailable): %s", model_label, e)
return (
pd.concat(metric_frames, ignore_index=True),
pd.concat(pred_frames, ignore_index=True),
)
11.1 Short-term horizon - 1 day ahead (h = 48)#
Sweep contexts: 2 days · 1 week · 2 weeks · 1 month.
short_contexts = [
("2 days", 48 * 2),
("1 week", 48 * 7),
("2 weeks", 48 * 14),
("1 month", 48 * 30),
]
metrics_short, preds_short = run_context_sweep(
horizon=48,
contexts=short_contexts,
train_df=train_df,
test_df=test_df,
)
log.info(
"Short-term sweep complete: %d (model, context) runs",
metrics_short.groupby(["model_label", "context_label"]).ngroups,
)
2026-06-14 20:31:31 | INFO | twiga.tutorials | TabICL …
2026-06-14 20:31:31 | INFO | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────
2026-06-14 20:31:31 | INFO | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:31 | WARNING | twiga.tutorials | Skipping TabICL (unavailable): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).
2026-06-14 20:31:31 | INFO | twiga.tutorials | Moirai-2 …
2026-06-14 20:31:31 | INFO | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────
2026-06-14 20:31:31 | INFO | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:31 | WARNING | twiga.tutorials | Skipping Moirai-2 (unavailable): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'
2026-06-14 20:31:31 | INFO | twiga.tutorials | TimesFM …
2026-06-14 20:31:31 | INFO | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────
2026-06-14 20:31:31 | INFO | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:31 | WARNING | twiga.tutorials | Skipping TimesFM (unavailable): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"
2026-06-14 20:31:31 | INFO | twiga.tutorials | Lag-Llama …
2026-06-14 20:31:31 | INFO | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────
2026-06-14 20:31:31 | INFO | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:31 | INFO | twiga.models.foundational.lag_llama_model | Auto-bumping Lag-Llama context_length 512 → 1440 to fit data lookback
2026-06-14 20:31:31 | WARNING | twiga.tutorials | Skipping Lag-Llama (unavailable): Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.
2026-06-14 20:31:31 | INFO | twiga.tutorials | Short-term sweep complete: 4 (model, context) runs
from lets_plot import (
aes,
element_blank,
facet_grid,
facet_wrap,
geom_line,
geom_point,
ggplot,
ggsize,
ggtitle,
labs,
scale_x_continuous,
theme,
)
def plot_context_metrics(metrics_df: pd.DataFrame, horizon_label: str):
"""Faceted line chart of mae / crps / pinball / calibration_error vs context length (in days)."""
metric_cols = ["mae", "crps", "pinball", "calibration_error"]
agg = (
metrics_df.groupby(["model_label", "context_label", "context_steps"], sort=False)[metric_cols]
.mean()
.reset_index()
)
agg["context_days"] = agg["context_steps"] / 48.0
long = agg.melt(
id_vars=["model_label", "context_label", "context_steps", "context_days"],
value_vars=metric_cols,
var_name="metric",
value_name="value",
)
return (
ggplot(long, aes(x="context_days", y="value", color="model_label"))
+ geom_line(size=1)
+ geom_point(size=2.5)
+ facet_wrap("metric", ncol=2, scales="free_y")
+ labs(
x="Context length (days)",
y="Metric value",
color="Model",
title=f"Context-length sensitivity - {horizon_label}",
)
+ ggsize(900, 600)
)
plot_context_metrics(metrics_short, "1-day horizon")
def build_trace_df(preds_df: pd.DataFrame, n_days: int = 2) -> pd.DataFrame:
"""Take the first ``n_days * 48`` rows of each (model, context) trace and reshape for plotting."""
n_keep = n_days * 48
sub = (
preds_df.sort_values(["model_label", "context_label", "timestamp"])
.groupby(["model_label", "context_label"], sort=False, group_keys=False)
.head(n_keep)
.reset_index(drop=True)
)
actual = (
sub[["model_label", "context_label", "context_steps", "timestamp", "Actual"]]
.rename(columns={"Actual": "value"})
.assign(series="Actual")
)
forecast = (
sub[["model_label", "context_label", "context_steps", "timestamp", "forecast"]]
.rename(columns={"forecast": "value"})
.assign(series="Forecast (median)")
)
return pd.concat([actual, forecast], ignore_index=True)
def plot_context_traces(preds_df: pd.DataFrame, horizon_label: str, n_days: int = 2):
long = build_trace_df(preds_df, n_days=n_days)
# Preserve the original context ordering on the x-facet
ctx_order = preds_df.drop_duplicates("context_label").sort_values("context_steps")["context_label"].tolist()
long["context_label"] = pd.Categorical(long["context_label"], categories=ctx_order, ordered=True)
return (
ggplot(long, aes(x="timestamp", y="value", color="series"))
+ geom_line(size=0.7)
+ facet_grid(y="model_label", x="context_label")
+ labs(
x="Time",
y="Net Load (kW)",
color="Series",
title=f"Forecast traces - {horizon_label} (first {n_days} days of test set)",
)
+ ggsize(1200, 650)
)
plot_context_traces(preds_short, "1-day horizon", n_days=3)
11.2 Long-term horizon - 1 week ahead (h = 336)#
Sweep contexts: 1 week · 2 weeks · 1 month · 2 months · 3 months. We raise TabICL’s max_context_length to 5000 so the 3-month context (4320 steps) fits.
long_contexts = [
("1 week", 48 * 7),
("2 weeks", 48 * 14),
("1 month", 48 * 30),
("2 months", 48 * 60),
("3 months", 48 * 90),
]
metrics_long, preds_long = run_context_sweep(
horizon=48 * 7,
contexts=long_contexts,
train_df=train_df,
test_df=test_df,
tabicl_max_context=5000,
)
log.info(
"Long-term sweep complete: %d (model, context) runs", metrics_long.groupby(["model_label", "context_label"]).ngroups
)
2026-06-14 20:31:42 | INFO | twiga.tutorials | TabICL …
2026-06-14 20:31:42 | INFO | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────
2026-06-14 20:31:42 | INFO | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:42 | WARNING | twiga.tutorials | Skipping TabICL (unavailable): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).
2026-06-14 20:31:42 | INFO | twiga.tutorials | Moirai-2 …
2026-06-14 20:31:43 | INFO | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────
2026-06-14 20:31:43 | INFO | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:43 | WARNING | twiga.tutorials | Skipping Moirai-2 (unavailable): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'
2026-06-14 20:31:43 | INFO | twiga.tutorials | TimesFM …
2026-06-14 20:31:43 | INFO | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────
2026-06-14 20:31:43 | INFO | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:43 | WARNING | twiga.tutorials | Skipping TimesFM (unavailable): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"
2026-06-14 20:31:43 | INFO | twiga.tutorials | Lag-Llama …
2026-06-14 20:31:43 | INFO | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────
2026-06-14 20:31:43 | INFO | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)
2026-06-14 20:31:43 | INFO | twiga.models.foundational.lag_llama_model | Auto-bumping Lag-Llama context_length 512 → 4320 to fit data lookback
2026-06-14 20:31:43 | WARNING | twiga.tutorials | Skipping Lag-Llama (unavailable): Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.
2026-06-14 20:31:43 | INFO | twiga.tutorials | Long-term sweep complete: 5 (model, context) runs
plot_context_metrics(metrics_long, "1-week horizon")
plot_context_traces(preds_long, "1-week horizon", n_days=3)
11.3 Reading the plots#
A few things to look for:
Where does each model’s MAE curve flatten? That’s the “enough context” point for that model.
Does CRPS improve with longer context, or get worse? A worsening trend often means the model is being distracted by stale dynamics.
Calibration vs context - sometimes longer context tightens the predictive distribution past the point where coverage breaks; the calibration-error panel is the canary.
Trace panels: look for systematic phase shifts or amplitude misses that resolve at a particular context length - that’s the model picking up the weekly seasonality.
Wrapping up#
What you did
Loaded and split the MLVS-PT dataset
Configured a data pipeline compatible with foundation models
Loaded and evaluated Chronos-2 (autoregressive transformer)
Loaded and evaluated TabICLv2 (tabular in-context learner)
Loaded and evaluated Moirai 2.0 (universal transformer)
Loaded and evaluated TimesFM 2.5 (decoder-only transformer)
Loaded and evaluated Lag-Llama (lag-feature decoder)
Trained a Seasonal-Naive baseline
Built a side-by-side benchmark table
Visualised forecast traces
Key takeaways
Foundation models are a new paradigm: pre-trained knowledge vs. local optimisation
Three very different architectures (autoregressive, in-context tabular, universal transformer) expose the same Twiga interface
Zero-shot performance is often competitive on diverse domains
Fine-tuning (Phase 2) will let you retain pre-training while adapting to your data
Choose based on: data size, latency budget, precision needs, and domain shift
What’s next?#
10 - Conformal Prediction: Add coverage guarantees to any forecast
Ensemble: Combine Chronos-2 + TabICL + Moirai + TimesFM + Lag-Llama + QR-LightGBM for robustness
Fine-tuning (Phase 2, coming soon): Train foundation models on your data while retaining the pre-training prior