Zero-Shot Foundation Models#

Author: Anthony Faustine, sambaiga@gmail.com

What you’ll build

Five zero-shot probabilistic forecasts on the same dataset, using five different foundation models:

Chronos-2 - Amazon’s 12B-parameter autoregressive transformer (21 quantiles).
TabICLv2 - A tabular foundation model that performs forecasting via in-context learning on engineered temporal features (9 quantiles).
Moirai 2.0 - Salesforce’s universal time-series transformer; quantile output via the uni2ts package (9 quantiles).
TimesFM 2.5 - Google’s 200M decoder-only transformer with a 16k context and a continuous quantile head (9 quantiles).
Lag-Llama - A decoder-only transformer using lag features; the first open-source TSFM (sample-based → 9 quantiles).

All five download pre-trained weights and produce quantile predictions without training on your data. We then benchmark them against a seasonal-naive baseline.

Prerequisites

01 - Getting Started (DataPipelineConfig, TwigaForecaster)
08 - Quantile Regression (quantile concepts, prediction intervals)
A GPU is helpful but not required (inference on CPU works, just slower)

Learning objectives

By the end of this notebook you will be able to:

Explain the paradigm shift: pre-trained, task-agnostic models vs. data-specific learning
Load and evaluate five zero-shot quantile forecasters side-by-side
Compare different foundation-model architectures (autoregressive, in-context tabular, universal transformer, lag-feature decoder)
Interpret when foundation models excel vs. when domain-specific fine-tuning wins
Visualise quantile fan charts and coverage plots from foundation models

1. The Foundation Model Paradigm#

Traditional approach (Notebooks 05-10):

Your data → Feature engineering → Train model → Evaluate

Model learns from your dataset only
Hyperparameter tuning required (5–100× slower)
Strong performance on YOUR domain IF you have enough data
Weak on new, unfamiliar patterns

Foundation model approach:

Pre-trained on 2M+ time series → Download weights → Forecast immediately

Model learned from internet-scale diversity (energy, traffic, retail, weather, …)
Zero training time - deploy in seconds
Works reasonably on most domains WITHOUT fine-tuning
May underperform on highly specialised data

Chronos-2 Architecture#

Chronos-2 is an autoregressive transformer (like GPT, but for time series):

Input: tokenise the lookback window into a sequence of values
Encoder-Decoder: transformer learns to predict the next value given prior context
Output: 21 quantile values in parallel (τ = 0.025, 0.05, …, 0.975)

The 21 quantile levels are fixed by Chronos-2’s architecture - you cannot request arbitrary quantiles, but you get a rich probabilistic forecast “for free”.

When to use Chronos-2#

Use Case	Recommendation
Rapid prototyping	✓ Use Chronos-2 first - validate the problem in minutes
Small datasets (<1000 samples)	✓ Chronos-2 may outperform underfitted QR models
New domains (unfamiliar patterns)	✓ Chronos-2’s pre-training covers many scenarios
Production on limited data	✓ No fine-tuning risk, stable predictions
Large, domain-specific datasets	✗ Fine-tuned QR or NN models often win
Extreme latency constraints	✗ Chronos-2 inference is ~seconds per window (batch > 1 helps)
Custom quantile levels	✗ You are locked to the 21 fixed quantiles

2. Setup#

import warnings

from great_tables import GT, md
from IPython.display import clear_output
from lets_plot import LetsPlot
import pandas as pd

from twiga import TwigaForecaster
from twiga.core.config import DataPipelineConfig, ExperimentConfig
from twiga.core.plot import plot_forecast_grid, plot_metrics_bar
from twiga.core.plot.gt import twiga_gt, twiga_report
from twiga.core.utils import configure, get_logger

LetsPlot.setup_html()
warnings.filterwarnings("ignore")

configure()
log = get_logger("tutorials")

Load data#

data = pd.read_parquet("../data/MLVS-PT.parquet")
data = data[["timestamp", "NetLoad(kW)"]].copy()
data["timestamp"] = pd.to_datetime(data["timestamp"])
data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
# Restrict to 2019-2020 for fast tutorial execution
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)

log.info("Shape: %s", data.shape)
log.info("Period: %s -> %s", data["timestamp"].min().date(), data["timestamp"].max().date())

2026-06-14 20:31:10 | INFO     | twiga.tutorials | Shape: (33553, 2)

2026-06-14 20:31:10 | INFO     | twiga.tutorials | Period: 2019-02-01 -> 2020-12-31

Train / test split#

train_df = data[data["timestamp"] < "2020-07-01"].reset_index(drop=True)
test_df = data[data["timestamp"] >= "2020-07-01"].reset_index(drop=True)

log.info(
    "train : %d rows  (%s -> %s)",
    len(train_df),
    train_df["timestamp"].min().date(),
    train_df["timestamp"].max().date(),
)
log.info(
    "test  : %d rows  (%s -> %s)",
    len(test_df),
    test_df["timestamp"].min().date(),
    test_df["timestamp"].max().date(),
)

2026-06-14 20:31:10 | INFO     | twiga.tutorials | train : 24766 rows  (2019-02-01 -> 2020-06-30)

2026-06-14 20:31:10 | INFO     | twiga.tutorials | test  : 8787 rows  (2020-07-01 -> 2020-12-31)

Shared pipeline and training config#

Note: Foundation models use the same DataPipelineConfig as learned models. The lookback_window_size is the context window Chronos-2 sees; forecast_horizon is the number of steps predicted.

data_config = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    period="30min",
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=[],
    known_future_features=[],  # Chronos-2 ignores exogenous features (zero-shot)
    forecast_horizon=48,  # 24 hours @ 30-min resolution
    lookback_window_size=48 * 7,  # one week of context
    window_stride=48,  # non-overlapping windows for honest evaluation
    input_scaler="standard",
    target_scaler="robust",
)

train_config = ExperimentConfig(
    project_name="twiga-tutorials",
)

3. Zero-Shot Chronos-2 Foundation Model#

Key property: Zero-shot means fit() does not train.

The TwigaForecaster.fit() method:

Validates input dimensions
Loads the pre-trained Chronos-2 weights from HuggingFace
Returns immediately (no optimization loop)

All CV folds run through the same pre-trained model - there is no per-fold retraining.

from twiga.models.foundational.chronos2_model import Chronos2Config

# device="cuda" if you have a GPU; "cpu" works but is slower
chronos_config = Chronos2Config(device="cpu")

forecaster_chronos = TwigaForecaster(
    data_params=data_config,
    model_params=[chronos_config],
    cv_params=train_config,
)

# fit() downloads the model but does NOT train
forecaster_chronos.fit(train_df=train_df)
# clear_output()

log.info("Chronos-2 zero-shot model ready for inference.")

2026-06-14 20:31:10 | INFO     | twiga.forecaster.base | ─────────────────── Training chronos2 ────────────────────

2026-06-14 20:31:12 | INFO     | twiga.models.foundational.chronos2_model | Loading amazon/chronos-2 on device=cpu

`torch_dtype` is deprecated! Use `dtype` instead!

`torch_dtype` is deprecated! Use `dtype` instead!

2026-06-14 20:31:13 | INFO     | twiga.models.foundational.chronos2_model | Chronos-2 ready  device=cpu  horizon=48

2026-06-14 20:31:13 | INFO     | twiga.forecaster.base | Training chronos2 complete  duration=0.04 min

2026-06-14 20:31:13 | INFO     | twiga.tutorials | Chronos-2 zero-shot model ready for inference.

Evaluate Chronos-2’s quantile forecast#

Chronos-2 outputs 21 quantiles. We evaluate it exactly like a trained QR model.

pred_chronos, metric_chronos = forecaster_chronos.evaluate_quantile_forecast(test_df=test_df)
clear_output()

log.info("Chronos-2 evaluation complete.")
log.info("Predictions shape: %s", pred_chronos.shape)
log.info("Quantile columns: %s", [c for c in pred_chronos.columns if c.startswith("q_")][:5])

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Chronos-2 evaluation complete.

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Predictions shape: (8784, 5)

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Quantile columns: []

def get_metric_table(metric_df: pd.DataFrame, metric_cols: list | None = None):
    if metric_cols is None:
        metric_cols = ["mae", "corr", "pinball", "crps", "sharpness", "calibration_error"]

    res = metric_df.groupby("Model")[metric_cols].mean().round(2).reset_index()
    res = res.rename(
        columns={
            "mae": "MAE",
            "corr": "Corr",
            "pinball": "PINBALL",
            "crps": "CRPS",
            "sharpness": "SHARPNESS",
            "calibration_error": "Cal-err",
        }
    )

    metric_name = ["MAE", "Corr", "PINBALL", "CRPS", "SHARPNESS", "Cal-err"]
    minimize_cols = ["MAE", "SMAPE", "RMSE"]
    maximize_cols = ["Corr"]

    return twiga_report(res, metric_name, minimize_cols, maximize_cols)

get_metric_table(metric_chronos)

Model	MAE	Corr	PINBALL	CRPS	SHARPNESS	Cal-err
Model Performance
Metric comparison across all evaluated models
CHRONOS2	3.770	0.890	1.310	2.680	8.510	0.120
Twiga Forecast

4. Zero-Shot TabICL Foundation Model#

TabICLv2 takes a different architectural route to zero-shot forecasting: rather than autoregressively decoding tokenised values, it engineers temporal features (lags, calendar context, etc.) and runs in-context learning with a tabular foundation model. The pre-training mixes synthetic and real tabular datasets, so it generalises across domains without per-task fitting.

Same Twiga interface - only the model swap differs.

9 quantiles by default: [0.1, 0.2, ..., 0.9]
Inference is ensembled across n_estimators members (4–8 is a good speed/quality trade-off on CPU)

from twiga.models.foundational.tabicl_model import TabICLConfig

# device="cuda" if you have a GPU; "cpu" works but is slower
tabicl_config = TabICLConfig(device="cpu", n_estimators=4, max_context_length=48 * 30)

forecaster_tabicl = TwigaForecaster(
    data_params=data_config,
    model_params=[tabicl_config],
    cv_params=train_config,
)

# fit() instantiates the forecaster; weights are loaded lazily on first predict
try:
    forecaster_tabicl.fit(train_df=train_df)
    tabicl_available = True
    log.info("TabICL zero-shot model ready for inference.")
except ImportError as e:
    tabicl_available = False
    log.warning("TabICL unavailable (numpy<2.2 conflict with gluonts): %s", e)

2026-06-14 20:31:16 | INFO     | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────

2026-06-14 20:31:16 | INFO     | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:16 | WARNING  | twiga.tutorials | TabICL unavailable (numpy<2.2 conflict with gluonts): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).

Evaluate TabICL’s quantile forecast#

TabICL produces 9 quantiles. We evaluate it through the same Twiga quantile-forecast pathway as Chronos-2.

import pandas as pd

if tabicl_available:
    pred_tabicl, metric_tabicl = forecaster_tabicl.evaluate_quantile_forecast(test_df=test_df)
    clear_output()
    log.info("TabICL evaluation complete.")
    log.info("Predictions shape: %s", pred_tabicl.shape)
else:
    pred_tabicl = pd.DataFrame()
    metric_tabicl = pd.DataFrame()
    log.info("TabICL skipped - not available in this environment.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | TabICL skipped - not available in this environment.

if tabicl_available:
    get_metric_table(metric_tabicl)
else:
    log.info("TabICL skipped  -  no metrics table to display.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | TabICL skipped  -  no metrics table to display.

5. Zero-Shot Moirai Foundation Model#

Moirai (Salesforce Research) is a universal time-series transformer trained on a massive multi-domain corpus. We use Moirai 2.0, which produces native quantile output. The same wrapper also supports Moirai 1.x and Moirai-MoE via the model_type config field.

Moirai is loaded through the uni2ts package and runs the GluonTS predictor pipeline under the hood; Twiga reshapes the output into the same (batch, n_quantiles, horizon) contract as the other foundation models.

from twiga.models.foundational.moirai_model import MoiraiConfig

try:
    moirai_config = MoiraiConfig(
        model_type="moirai2",
        size="small",
        device="cpu",
        frequency="30min",
    )
    forecaster_moirai = TwigaForecaster(
        data_params=data_config,
        model_params=[moirai_config],
        cv_params=train_config,
    )
    forecaster_moirai.fit(train_df=train_df)
    moirai_available = True
    log.info("Moirai zero-shot model ready for inference.")
except ImportError as e:
    moirai_available = False
    pred_moirai = pd.DataFrame()
    metric_moirai = pd.DataFrame()
    log.warning("Moirai unavailable (uni2ts not installed): %s", e)

2026-06-14 20:31:16 | INFO     | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────

2026-06-14 20:31:16 | INFO     | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:16 | WARNING  | twiga.tutorials | Moirai unavailable (uni2ts not installed): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'

Evaluate Moirai’s quantile forecast#

Moirai 2.0 emits quantile outputs natively. Twiga reduces these to its standard 9-quantile grid and evaluates via the same quantile-forecast path as Chronos-2 and TabICL.

if moirai_available:
    pred_moirai, metric_moirai = forecaster_moirai.evaluate_quantile_forecast(test_df=test_df)
    clear_output()
    log.info("Moirai evaluation complete.")
    log.info("Predictions shape: %s", pred_moirai.shape)
else:
    log.info("Moirai skipped - not available in this environment.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Moirai skipped - not available in this environment.

if moirai_available:
    get_metric_table(metric_moirai)
else:
    log.info("Moirai skipped - no metrics table to display.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Moirai skipped - no metrics table to display.

6. Zero-Shot TimesFM Foundation Model#

TimesFM 2.5 (Google Research) is a 200M-parameter decoder-only transformer with a 16k-token context and a continuous quantile head. Like Chronos-2 it forecasts autoregressively, but at a fraction of the parameter count, and it emits 9 quantile levels (0.1 … 0.9) natively.

TimesFM 2.5 is not on PyPI - install it from source: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]".

from twiga.models.foundational.timesfm_model import TimesFMConfig

try:
    timesfm_config = TimesFMConfig(device="cpu")
    forecaster_timesfm = TwigaForecaster(
        data_params=data_config,
        model_params=[timesfm_config],
        cv_params=train_config,
    )
    forecaster_timesfm.fit(train_df=train_df)
    timesfm_available = True
    log.info("TimesFM zero-shot model ready for inference.")
except ImportError as e:
    timesfm_available = False
    pred_timesfm = pd.DataFrame()
    metric_timesfm = pd.DataFrame()
    log.warning("TimesFM unavailable (timesfm not installed): %s", e)

2026-06-14 20:31:16 | INFO     | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────

2026-06-14 20:31:16 | INFO     | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:16 | WARNING  | twiga.tutorials | TimesFM unavailable (timesfm not installed): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"

Evaluate TimesFM’s quantile forecast#

TimesFM 2.5’s continuous quantile head yields 9 quantiles. We evaluate it through the same quantile-forecast path as the other foundation models.

if timesfm_available:
    pred_timesfm, metric_timesfm = forecaster_timesfm.evaluate_quantile_forecast(test_df=test_df)
    clear_output()
    log.info("TimesFM evaluation complete.")
    log.info("Predictions shape: %s", pred_timesfm.shape)
else:
    log.info("TimesFM skipped - not available in this environment.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | TimesFM skipped - not available in this environment.

if timesfm_available:
    get_metric_table(metric_timesfm)
else:
    log.info("TimesFM skipped - no metrics table to display.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | TimesFM skipped - no metrics table to display.

7. Zero-Shot Lag-Llama Foundation Model#

Lag-Llama was the first open-source time-series foundation model: a decoder-only transformer that ingests lag features rather than raw sequences. It is sample-based; we draw 100 samples per series and reduce them to the standard 9-quantile grid.

Lag-Llama is not on PyPI as a forecasting checkpoint - install the Python package, then download the weights into the Twiga model directory. Run these from the project root (twiga-forecast/):
pip install git+https://github.com/time-series-foundation-models/lag-llama.git
pip install "gluonts[torch]<=0.14.4" pytorch-lightning

huggingface-cli download time-series-foundation-models/Lag-Llama \\
    lag-llama.ckpt --local-dir twiga/models/foundational/lag-llama
This places the checkpoint at twiga/models/foundational/lag-llama/lag-llama.ckpt. The wrapper’s default ckpt_path="lag-llama/lag-llama.ckpt" is resolved against the project root automatically, so it’s found regardless of where the notebook runs from.

from twiga.models.foundational.lag_llama_model import LagLlamaConfig

try:
    lagllama_config = LagLlamaConfig(
        ckpt_path="lag-llama/lag-llama.ckpt",
        device="cpu",
        num_samples=100,
    )
    forecaster_lagllama = TwigaForecaster(
        data_params=data_config,
        model_params=[lagllama_config],
        cv_params=train_config,
    )
    forecaster_lagllama.fit(train_df=train_df)
    lagllama_available = True
    log.info("Lag-Llama zero-shot model ready for inference.")
except (ImportError, FileNotFoundError) as e:
    lagllama_available = False
    pred_lagllama = pd.DataFrame()
    metric_lagllama = pd.DataFrame()
    log.warning("Lag-Llama unavailable: %s", e)

2026-06-14 20:31:16 | INFO     | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────

2026-06-14 20:31:16 | INFO     | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:16 | WARNING  | twiga.tutorials | Lag-Llama unavailable: Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.

Evaluate Lag-Llama’s quantile forecast#

Lag-Llama emits probabilistic samples. Twiga reduces these to its standard 9-quantile grid and evaluates via the same quantile-forecast path as the other foundation models.

if lagllama_available:
    pred_lagllama, metric_lagllama = forecaster_lagllama.evaluate_quantile_forecast(test_df=test_df)
    clear_output()
    log.info("Lag-Llama evaluation complete.")
    log.info("Predictions shape: %s", pred_lagllama.shape)
else:
    log.info("Lag-Llama skipped - not available in this environment.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Lag-Llama skipped - not available in this environment.

if lagllama_available:
    get_metric_table(metric_lagllama)
else:
    log.info("Lag-Llama skipped - no metrics table to display.")

2026-06-14 20:31:16 | INFO     | twiga.tutorials | Lag-Llama skipped - no metrics table to display.

8. Benchmark: Five foundation models vs a simple baseline#

Compare all five zero-shot foundation models against a seasonal-naive baseline on the same test set.

from twiga.models.baseline.seasonal_naive_model import SEASONALNAIVEConfig

seasonal_config = SEASONALNAIVEConfig(period="7D", freq="30min")

forecaster_seasonal = TwigaForecaster(
    data_params=data_config,
    model_params=[seasonal_config],
    cv_params=train_config,
)
forecaster_seasonal.fit(train_df=train_df)
clear_output()

pred_seasonal, metric_seasonal = forecaster_seasonal.evaluate_point_forecast(test_df=test_df)
clear_output()

get_metric_table(metric_seasonal, metric_cols=["mae", "corr"])

Model	MAE	Corr
Model Performance
Metric comparison across all evaluated models
SEASONAL_NAIVE	5.090	0.830
Twiga Forecast

metrics_all = pd.concat(
    [
        df
        for df in [metric_chronos, metric_tabicl, metric_moirai, metric_timesfm, metric_lagllama, metric_seasonal]
        if not df.empty
    ],
    ignore_index=True,
)

res = (
    metrics_all.groupby("Model", sort=False)[["mae", "rmse", "corr", "wmape", "smape", "nbias"]]
    .mean()
    .round(4)
    .reset_index()
)
res = res.rename(
    columns={
        "mae": "MAE",
        "corr": "Corr",
        "wmape": "WMAPE",
        "smape": "SMAPE",
        "nbias": "NBIAS",
        "rmse": "RMSE",
    }
)

twiga_report(
    res,
    ["MAE", "Corr", "SMAPE", "RMSE"],
    ["MAE", "SMAPE", "RMSE"],
    ["Corr"],
)

Model	MAE	RMSE	Corr	WMAPE	SMAPE	NBIAS
Model Performance
Metric comparison across all evaluated models
CHRONOS2	3.772	5.066	0.891	10.2376	12.357	-0.0709
SEASONAL_NAIVE	5.088	6.724	0.828	13.7421	16.690	0.5737
Twiga Forecast

9. Key Insights#

Shared strengths (all five foundation models)#

No training time: zero-shot - fit() only loads pre-trained weights
Diverse pre-training: benefit from millions of cross-domain time series
Probabilistic by default: every model emits quantile output without extra work
Stable predictions: pre-trained weights don’t drift per fold or per dataset

Shared limitations#

Generic: no model has seen your specific operating conditions
Inference cost: all five are slow per window on CPU compared to a tuned QR-LightGBM; a GPU helps dramatically
Fixed (or near-fixed) quantile grids: Chronos-2 is fully fixed at 21; TabICL / Moirai accept a quantile list; TimesFM emits a fixed 9-quantile head - none let you request arbitrary precision below their grid resolution
Covariate support: the Twiga wrappers in this notebook ignore exogenous features - they pass only the target series through

When learned (per-dataset) models win#

Large domain-specific datasets (> 10 k samples) with stable dynamics
Extreme precision requirements (financial trading, critical infrastructure)
Need custom quantile levels or strict interval shapes
Latency constraints (inference must be < 100 ms)

Phase 2: Fine-tuning#

Future Twiga releases will expose fine-tuning hooks for each foundation model:

Chronos2Config(device="cuda", fine_tune=True)   # planned
TabICLConfig(fine_tune=True)                    # FinetunedTabICLRegressor already exists upstream
MoiraiConfig(model_type="moirai", fine_tune=True)  # MoiraiFinetune supports Moirai 1.x / MoE
LagLlamaConfig(fine_tune=True)  # Lag-Llama exposes a PyTorch Lightning training loop upstream
# TimesFM 2.5 fine-tuning is out of scope for the wrapper; use the upstream examples/finetuning recipe

This combines the best of both worlds: transfer learning from a broad pre-training prior plus domain-specific adaptation.

10. Forecast Traces#

Visual comparison of predictions across the test set.

preds_all = pd.concat(
    [df for df in [pred_chronos, pred_tabicl, pred_moirai, pred_timesfm, pred_lagllama, pred_seasonal] if not df.empty],
    ignore_index=True,
)

p = plot_forecast_grid(
    preds_all,
    actual_col="Actual",
    forecast_col="forecast",
    model_col="Model",
    n_samples_per_model=7 * 48,  # first 7 days of test set
    y_label="Net Load (kW)",
    title="Chronos-2 vs TabICL vs Moirai vs TimesFM vs Lag-Llama vs Seasonal-Naive  -  first 7 days of test set",
    fig_width=1200,
)
p

11. Context-Length Sensitivity#

How much past does each foundation model actually need to see? We sweep the lookback window across two horizons:

Short-term (1-day ahead, h = 48 steps): contexts of 2 days · 1 week · 2 weeks · 1 month.
Long-term (1-week ahead, h = 336 steps): contexts of 1 week · 2 weeks · 1 month · 2 months · 3 months.

For each (model, context) pair we rebuild the DataPipelineConfig with a new lookback_window_size and forecast_horizon, then re-fit + re-evaluate against the same test_df. Each fit is zero-shot (no training), but CPU inference scales with horizon × context, so the long-term sweep takes meaningfully longer than the short-term one.

⚠️ TabICL’s default max_context_length is 4096; the long-term sweep includes a 3-month context (4320 steps), so we bump that cap in the config for that sweep.

from twiga.models.foundational.chronos2_model import Chronos2Config
from twiga.models.foundational.lag_llama_model import LagLlamaConfig
from twiga.models.foundational.moirai_model import MoiraiConfig
from twiga.models.foundational.tabicl_model import TabICLConfig
from twiga.models.foundational.timesfm_model import TimesFMConfig


def make_model_configs(*, tabicl_max_context: int = 4096) -> list[tuple[str, object]]:
    """Build a fresh set of (display_name, config) pairs for one sweep iteration."""
    return [
        ("Chronos-2", Chronos2Config(device="cpu")),
        (
            "TabICL",
            TabICLConfig(device="cpu", n_estimators=4, max_context_length=tabicl_max_context),
        ),
        (
            "Moirai-2",
            MoiraiConfig(model_type="moirai2", size="small", device="cpu", frequency="30min"),
        ),
        ("TimesFM", TimesFMConfig(device="cpu")),
        ("Lag-Llama", LagLlamaConfig(device="cpu", num_samples=100)),
    ]


def run_context_sweep(
    *,
    horizon: int,
    contexts: list[tuple[str, int]],
    train_df: pd.DataFrame,
    test_df: pd.DataFrame,
    tabicl_max_context: int = 4096,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Evaluate each foundational model under each context length for a fixed horizon.

    Returns ``(metrics_df, preds_df)`` annotated with ``context_label``,
    ``context_steps`` (in 30-min steps) and ``model_label`` columns ready to plot.
    """
    base_dc_kwargs = {
        "target_feature": "NetLoad(kW)",
        "period": "30min",
        "latitude": 32.371666,
        "longitude": -16.274998,
        "calendar_features": [],
        "known_future_features": [],
        "input_scaler": "standard",
        "target_scaler": "robust",
    }

    metric_frames, pred_frames = [], []
    for ctx_label, ctx_steps in contexts:
        log.info("Context = %s (%d steps), horizon = %d", ctx_label, ctx_steps, horizon)
        dc = DataPipelineConfig(
            **base_dc_kwargs,
            forecast_horizon=horizon,
            lookback_window_size=ctx_steps,
            window_stride=horizon,  # non-overlapping eval windows
        )
        for model_label, model_config in make_model_configs(tabicl_max_context=tabicl_max_context):
            log.info("  %s …", model_label)
            try:
                forecaster = TwigaForecaster(data_params=dc, model_params=[model_config], cv_params=train_config)
                forecaster.fit(train_df=train_df)
                preds, metrics = forecaster.evaluate_quantile_forecast(test_df=test_df)
                clear_output()
                metrics = metrics.assign(context_label=ctx_label, context_steps=ctx_steps, model_label=model_label)
                preds = preds.assign(context_label=ctx_label, context_steps=ctx_steps, model_label=model_label)
                metric_frames.append(metrics)
                pred_frames.append(preds)
            except (ImportError, FileNotFoundError) as e:
                log.warning("  Skipping %s (unavailable): %s", model_label, e)

    return (
        pd.concat(metric_frames, ignore_index=True),
        pd.concat(pred_frames, ignore_index=True),
    )

11.1 Short-term horizon - 1 day ahead (h = 48)#

Sweep contexts: 2 days · 1 week · 2 weeks · 1 month.

short_contexts = [
    ("2 days", 48 * 2),
    ("1 week", 48 * 7),
    ("2 weeks", 48 * 14),
    ("1 month", 48 * 30),
]

metrics_short, preds_short = run_context_sweep(
    horizon=48,
    contexts=short_contexts,
    train_df=train_df,
    test_df=test_df,
)
log.info(
    "Short-term sweep complete: %d (model, context) runs",
    metrics_short.groupby(["model_label", "context_label"]).ngroups,
)

2026-06-14 20:31:31 | INFO     | twiga.tutorials |   TabICL …

2026-06-14 20:31:31 | INFO     | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────

2026-06-14 20:31:31 | INFO     | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:31 | WARNING  | twiga.tutorials |   Skipping TabICL (unavailable): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).

2026-06-14 20:31:31 | INFO     | twiga.tutorials |   Moirai-2 …

2026-06-14 20:31:31 | INFO     | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────

2026-06-14 20:31:31 | INFO     | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:31 | WARNING  | twiga.tutorials |   Skipping Moirai-2 (unavailable): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'

2026-06-14 20:31:31 | INFO     | twiga.tutorials |   TimesFM …

2026-06-14 20:31:31 | INFO     | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────

2026-06-14 20:31:31 | INFO     | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:31 | WARNING  | twiga.tutorials |   Skipping TimesFM (unavailable): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"

2026-06-14 20:31:31 | INFO     | twiga.tutorials |   Lag-Llama …

2026-06-14 20:31:31 | INFO     | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────

2026-06-14 20:31:31 | INFO     | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:31 | INFO     | twiga.models.foundational.lag_llama_model | Auto-bumping Lag-Llama context_length 512 → 1440 to fit data lookback

2026-06-14 20:31:31 | WARNING  | twiga.tutorials |   Skipping Lag-Llama (unavailable): Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.

2026-06-14 20:31:31 | INFO     | twiga.tutorials | Short-term sweep complete: 4 (model, context) runs

from lets_plot import (
    aes,
    element_blank,
    facet_grid,
    facet_wrap,
    geom_line,
    geom_point,
    ggplot,
    ggsize,
    ggtitle,
    labs,
    scale_x_continuous,
    theme,
)


def plot_context_metrics(metrics_df: pd.DataFrame, horizon_label: str):
    """Faceted line chart of mae / crps / pinball / calibration_error vs context length (in days)."""
    metric_cols = ["mae", "crps", "pinball", "calibration_error"]
    agg = (
        metrics_df.groupby(["model_label", "context_label", "context_steps"], sort=False)[metric_cols]
        .mean()
        .reset_index()
    )
    agg["context_days"] = agg["context_steps"] / 48.0
    long = agg.melt(
        id_vars=["model_label", "context_label", "context_steps", "context_days"],
        value_vars=metric_cols,
        var_name="metric",
        value_name="value",
    )
    return (
        ggplot(long, aes(x="context_days", y="value", color="model_label"))
        + geom_line(size=1)
        + geom_point(size=2.5)
        + facet_wrap("metric", ncol=2, scales="free_y")
        + labs(
            x="Context length (days)",
            y="Metric value",
            color="Model",
            title=f"Context-length sensitivity  -  {horizon_label}",
        )
        + ggsize(900, 600)
    )


plot_context_metrics(metrics_short, "1-day horizon")

def build_trace_df(preds_df: pd.DataFrame, n_days: int = 2) -> pd.DataFrame:
    """Take the first ``n_days * 48`` rows of each (model, context) trace and reshape for plotting."""
    n_keep = n_days * 48
    sub = (
        preds_df.sort_values(["model_label", "context_label", "timestamp"])
        .groupby(["model_label", "context_label"], sort=False, group_keys=False)
        .head(n_keep)
        .reset_index(drop=True)
    )
    actual = (
        sub[["model_label", "context_label", "context_steps", "timestamp", "Actual"]]
        .rename(columns={"Actual": "value"})
        .assign(series="Actual")
    )
    forecast = (
        sub[["model_label", "context_label", "context_steps", "timestamp", "forecast"]]
        .rename(columns={"forecast": "value"})
        .assign(series="Forecast (median)")
    )
    return pd.concat([actual, forecast], ignore_index=True)


def plot_context_traces(preds_df: pd.DataFrame, horizon_label: str, n_days: int = 2):
    long = build_trace_df(preds_df, n_days=n_days)
    # Preserve the original context ordering on the x-facet
    ctx_order = preds_df.drop_duplicates("context_label").sort_values("context_steps")["context_label"].tolist()
    long["context_label"] = pd.Categorical(long["context_label"], categories=ctx_order, ordered=True)
    return (
        ggplot(long, aes(x="timestamp", y="value", color="series"))
        + geom_line(size=0.7)
        + facet_grid(y="model_label", x="context_label")
        + labs(
            x="Time",
            y="Net Load (kW)",
            color="Series",
            title=f"Forecast traces  -  {horizon_label} (first {n_days} days of test set)",
        )
        + ggsize(1200, 650)
    )


plot_context_traces(preds_short, "1-day horizon", n_days=3)

11.2 Long-term horizon - 1 week ahead (h = 336)#

Sweep contexts: 1 week · 2 weeks · 1 month · 2 months · 3 months. We raise TabICL’s max_context_length to 5000 so the 3-month context (4320 steps) fits.

long_contexts = [
    ("1 week", 48 * 7),
    ("2 weeks", 48 * 14),
    ("1 month", 48 * 30),
    ("2 months", 48 * 60),
    ("3 months", 48 * 90),
]

metrics_long, preds_long = run_context_sweep(
    horizon=48 * 7,
    contexts=long_contexts,
    train_df=train_df,
    test_df=test_df,
    tabicl_max_context=5000,
)
log.info(
    "Long-term sweep complete: %d (model, context) runs", metrics_long.groupby(["model_label", "context_label"]).ngroups
)

2026-06-14 20:31:42 | INFO     | twiga.tutorials |   TabICL …

2026-06-14 20:31:42 | INFO     | twiga.forecaster.base | ──────────────────── Training tabicl ─────────────────────

2026-06-14 20:31:42 | INFO     | twiga.models.foundational.tabicl_model | Fitting TabICL model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:42 | WARNING  | twiga.tutorials |   Skipping TabICL (unavailable): tabicl library required for TabICLModel. Install with: pip install 'tabicl[forecast]>=2.1' (kept out of twiga[foundational] due to a numpy<2.2 pin transitively imposed by gluonts).

2026-06-14 20:31:42 | INFO     | twiga.tutorials |   Moirai-2 …

2026-06-14 20:31:43 | INFO     | twiga.forecaster.base | ──────────────────── Training moirai ─────────────────────

2026-06-14 20:31:43 | INFO     | twiga.models.foundational.moirai_model | Fitting Moirai model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:43 | WARNING  | twiga.tutorials |   Skipping Moirai-2 (unavailable): uni2ts library required for MoiraiModel. Install with: pip install 'uni2ts>=1.2'

2026-06-14 20:31:43 | INFO     | twiga.tutorials |   TimesFM …

2026-06-14 20:31:43 | INFO     | twiga.forecaster.base | ──────────────────── Training timesfm ────────────────────

2026-06-14 20:31:43 | INFO     | twiga.models.foundational.timesfm_model | Fitting TimesFM model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:43 | WARNING  | twiga.tutorials |   Skipping TimesFM (unavailable): timesfm library required for TimesFMModel. TimesFM 2.5 is not on PyPI; install with: pip install "git+https://github.com/google-research/timesfm.git#egg=timesfm[torch]"

2026-06-14 20:31:43 | INFO     | twiga.tutorials |   Lag-Llama …

2026-06-14 20:31:43 | INFO     | twiga.forecaster.base | ─────────────────── Training lag_llama ───────────────────

2026-06-14 20:31:43 | INFO     | twiga.models.foundational.lag_llama_model | Fitting Lag-Llama model (zero-shot, no training just loading pre-trained weights)

2026-06-14 20:31:43 | INFO     | twiga.models.foundational.lag_llama_model | Auto-bumping Lag-Llama context_length 512 → 4320 to fit data lookback

2026-06-14 20:31:43 | WARNING  | twiga.tutorials |   Skipping Lag-Llama (unavailable): Lag-Llama checkpoint not found at lag-llama/lag-llama.ckpt. Download with: `huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir lag-llama` and clone the repo into the same directory.

2026-06-14 20:31:43 | INFO     | twiga.tutorials | Long-term sweep complete: 5 (model, context) runs

plot_context_metrics(metrics_long, "1-week horizon")

plot_context_traces(preds_long, "1-week horizon", n_days=3)

11.3 Reading the plots#

A few things to look for:

Where does each model’s MAE curve flatten? That’s the “enough context” point for that model.
Does CRPS improve with longer context, or get worse? A worsening trend often means the model is being distracted by stale dynamics.
Calibration vs context - sometimes longer context tightens the predictive distribution past the point where coverage breaks; the calibration-error panel is the canary.
Trace panels: look for systematic phase shifts or amplitude misses that resolve at a particular context length - that’s the model picking up the weekly seasonality.

What’s next?#

10 - Conformal Prediction: Add coverage guarantees to any forecast
Ensemble: Combine Chronos-2 + TabICL + Moirai + TimesFM + Lag-Llama + QR-LightGBM for robustness
Fine-tuning (Phase 2, coming soon): Train foundation models on your data while retaining the pre-training prior