Quantile Regression#
What you’ll build
Prediction-interval forecasters using QR-LightGBM, QR-XGBoost, and FPQR-LightGBM, evaluated on PICP, NMPI, and Winkler score - giving you calibrated uncertainty estimates instead of point predictions.
Prerequisites
01 - Getting Started (DataPipelineConfig, ForecasterConfig, TwigaForecaster.fit)
05 - ML Point Forecasting (gradient-boosted trees)
07 - Neural Networks (helpful but optional)
Python: basic statistics, quartile concepts
Learning objectives
By the end of this notebook you will be able to:
Explain the pinball loss and why it trains models to predict specific quantiles
Configure
QRLIGHTGBMConfig,QRXGBOOSTConfig, andFPQRLIGHTGBMConfigwith custom quantile gridsEvaluate prediction intervals using PICP, NMPI, and Winkler score
Compare fixed-grid QR vs. FPQR and choose the right approach for your use case
Visualise quantile fans and interval coverage plots to diagnose calibration issues
1. Motivation: Why a Single Number Is Not Enough#
A point forecast gives one number per time step: the model’s best guess for net load at 14:30 tomorrow.
For many operational decisions that single number is not enough.
Energy dispatch example
A grid operator scheduling spinning reserves needs to know not just the expected net load but the range of plausible outcomes. If net load could realistically be 15 % higher than the median, reserves must cover that upside. Ignoring uncertainty leads to either:
under-reservation: system reliability at risk, or
over-reservation: unnecessarily expensive.
Prediction intervals address this by providing a lower and upper bound that should contain the true value with a specified probability - e.g. a 90 % prediction interval should cover the actual observation 90 % of the time over many forecasts.
Quantile Regression (QR) is one of the most practical approaches:
Train separate models (or a single multi-output model) for different quantile levels, e.g. q=0.05, 0.5, 0.95.
The q=0.05 output is the lower bound and q=0.95 the upper bound for a 90 % interval.
No distributional assumption is required - the model learns the conditional quantiles directly from data.
Conformal post-processing can be applied to provide coverage guarantees.
Key concept - quantile regression and pinball loss
A standard regression model minimises the mean squared error (MSE), which targets the conditional mean of the output. A quantile regression model instead minimises the pinball loss (also called quantile loss) for a chosen quantile level τ ∈ (0, 1):
L(y, q, τ) = τ · max(y − q, 0) + (1 − τ) · max(q − y, 0)Concretely:
When τ = 0.5, the pinball loss is equivalent to mean absolute error - the model targets the median.
When τ = 0.05, over-predictions are penalised 19× more than under-predictions - the model learns the 5th percentile of the conditional distribution.
When τ = 0.95, under-predictions are penalised 19× more - the model learns the 95th percentile.
The pair (q₀.₀₅, q₀.₉₅) forms a 90 % prediction interval. No distributional assumption (Gaussian, etc.) is required - the model learns the quantile structure directly from data.
Key concept - prediction interval evaluation metrics
Three complementary metrics assess whether a prediction interval is both reliable and sharp:
Metric
Full name
Formula (sketch)
Ideal value
PICP
Prediction Interval Coverage Probability
fraction of actuals inside [lower, upper]
≥ 1 − α (e.g. ≥ 0.90)
NMPI
Normalised Mean Prediction Interval
mean(upper − lower) / range(y)
as small as possible
Winkler score
Winkler (1972)
width + penalty for misses
lower is better
PICP measures reliability: is the interval wide enough to contain the true value at the claimed frequency? A 90 % interval that only captures 70 % of actuals is miscalibrated.
NMPI measures sharpness: are the bounds tight? An interval covering ±∞ has perfect PICP but is useless in practice.
Winkler score synthesises both: it equals the interval width when the actual is inside, and adds a large penalty when the actual falls outside. Lower is better, and it rewards narrow-but-accurate intervals over wide-but-safe ones.
Raw QR bounds carry no formal coverage guarantee - empirical PICP may fall below
1 − α. For a guarantee, apply conformal calibration (NB09).
Key concept - FPQR vs fixed-grid QR
Standard quantile regression uses a fixed grid of τ values specified at training time (e.g. 0.05, 0.10, …, 0.95). The model learns a separate output for each τ - an independent pinball loss is minimised for each level.
Flexible Proportional Quantile Regression (FPQR) takes a different approach: a small
QuantileProposalnetwork learns to propose its own quantile grid adaptively, per sample and per horizon step. This has two practical consequences:
Coverage shape adapts to the data - if the conditional distribution is skewed or multimodal, FPQR can allocate more quantile mass where uncertainty is highest.
Point forecast quality improves - instead of reading off the median from a fixed grid, FPQR computes a weighted expectation ∑ τ̂ᵢ · qᵢ, which is a smoother and often more accurate point estimate.
The trade-off: fixed QR lets you audit specific quantile levels (e.g. “what is the 5th percentile?”) directly. FPQR returns an averaged grid that is harder to interpret at a specific level but better captures complex uncertainty shapes.
When to prefer FPQR: signals with skewed residuals (solar irradiance, electricity prices), large datasets, or when interval shape matters more than auditability of specific quantile levels.
2. Setup#
import warnings
from great_tables import GT
from lets_plot import LetsPlot
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler
from twiga.core.plot.gt import twiga_gt, twiga_report
LetsPlot.setup_html()
from twiga.core.config import DataPipelineConfig, ForecasterConfig
from twiga.core.plot import (
plot_forecast,
plot_forecast_grid,
plot_metrics_bar,
plot_reliability_diagram,
)
from twiga.core.utils import configure, get_logger
warnings.filterwarnings("ignore")
configure()
log = get_logger("tutorials")
Load data#
data = pd.read_parquet("../data/MLVS-PT.parquet")
data = data[["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]]
data["timestamp"] = pd.to_datetime(data["timestamp"])
data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)
log.info("Shape: %s", data.shape)
twiga_gt(GT(data.head().round(2)))
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 data = pd.read_parquet("../data/MLVS-PT.parquet")
2 data = data[["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]]
3 data["timestamp"] = pd.to_datetime(data["timestamp"])
4 data = data.drop_duplicates(subset="timestamp").reset_index(drop=True)
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
256 if manager == "array":
257 to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
261 storage_options=storage_options,
262 mode="rb",
263 )
264 try:
265 pa_table = self.api.parquet.read_table(
266 path_or_handle,
267 columns=columns,
(...) 270 **kwargs,
271 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
131 handles = None
132 if (
133 not fs
134 and not is_dir
(...) 139 # fsspec resources can also point to directories
140 # this branch is used for example when reading from non-fsspec URLs
--> 141 handles = get_handle(
142 path_or_handle, mode, is_text=False, storage_options=storage_options
143 )
144 fs = None
145 path_or_handle = handles.handle
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
873 handle = open(
874 handle,
875 ioargs.mode,
(...) 878 newline="",
879 )
880 else:
881 # Binary mode
--> 882 handle = open(handle, ioargs.mode)
883 handles.append(handle)
885 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'
Train / val / test splits#
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
splits_df = pd.DataFrame(
{
"Split": ["Train", "Validation", "Test"],
"Period": ["before 2020-01-01", "2020-01-01 – 2020-06-30", "2020-07-01 onwards"],
"Purpose": ["Model learning", "Early-stopping / overfitting guard", "Final honest evaluation"],
}
)
twiga_gt(
GT(splits_df)
.tab_header(title=md("**Data Splits**"), subtitle="Chronological — no shuffling, no overlap")
.cols_label(**{c: md(f"**{c}**") for c in splits_df.columns})
.tab_source_note("Twiga Forecast"),
n_rows=len(splits_df),
)
train_df = data[data["timestamp"] < "2020-01-01"].reset_index(drop=True)
val_df = data[(data["timestamp"] >= "2020-01-01") & (data["timestamp"] < "2020-07-01")].reset_index(drop=True)
test_df = data[data["timestamp"] >= "2020-07-01"].reset_index(drop=True)
log.info(
f"train : {train_df.shape[0]:,} rows ({train_df['timestamp'].min().date()} → {train_df['timestamp'].max().date()})"
)
log.info(f"val : {val_df.shape[0]:,} rows ({val_df['timestamp'].min().date()} → {val_df['timestamp'].max().date()})")
log.info(
f"test : {test_df.shape[0]:,} rows ({test_df['timestamp'].min().date()} → {test_df['timestamp'].max().date()})"
)
3. ML Quantile Model: QR XGBoost#
QRXGBOOSTConfig trains a multi-quantile XGBoost model. Each quantile level is a separate output; the median (q=0.5) becomes the point forecast and the outer quantiles (q=0.05, q=0.95) form a 90 % prediction band.
The interval emerges directly from the model - no distributional assumption required. For coverage-guaranteed intervals via conformal post-processing, see NB09 - Conformal Prediction.
from IPython.display import clear_output
from twiga import TwigaForecaster
from twiga.models.ml import QRXGBOOSTConfig
qr_xg_config = QRXGBOOSTConfig(device="cpu")
forecaster_qr_xg = TwigaForecaster(
data_params=data_config,
model_params=[qr_xg_config],
train_params=train_config,
)
forecaster_qr_xg.fit(train_df=train_df, val_df=val_df)
clear_output()
log.info("QR XGBoost training complete.")
Reading QR XGBoost quantile metrics
pinball: the mean pinball loss averaged across all quantile levels and time steps. Lower is better; compare across models on the same dataset but do not compare across differentalphasettings.sharpness: mean interval width (upper − lower). Smaller values indicate tighter, more informative intervals - desirable as long as PICP is maintained.mae/rmse: point forecast quality using the median quantile as the prediction. These match what you would see from a standard point-forecast model.
pred_qr_xg, metric_qr_xg = forecaster_qr_xg.evaluate_quantile_forecast(test_df=test_df)
clear_output()
metric_qr_xg.columns
def get_metric_table(metric_df):
res = (
metric_df.groupby("Model")[["mae", "corr", "pinball", "crps", "sharpness", "calibration_error"]]
.mean()
.round(2)
.reset_index()
)
res = res.rename(
columns={
"mae": "MAE",
"corr": "Corr",
"pinball": "PINBALL",
"crps": "CRPS",
"sharpness": "SHARPNESS",
"calibration_error": "Cal-err",
}
)
metric_name = ["MAE", "Corr", "PINBALL", "CRPS", "SHARPNESS", "Cal-err"]
minimize_cols = ["MAE", "SMAPE", "RMSE"]
maximize_cols = ["Corr"]
return twiga_report(res, metric_name, minimize_cols, maximize_cols)
get_metric_table(metric_qr_xg)
4. NN Quantile Model: MLPF QR#
MLPFConfig(distribution="qr") adds a Group Additive Model head on top of the MLP encoder, giving the model an inductive bias toward additive feature contributions - useful when solar irradiance and calendar effects have near-separable impacts on net load.
As with the ML model, the quantile outputs form the interval directly. No calibration step is needed to obtain bounds.
from twiga.models.nn import MLPFConfig
qr_mlpf_config = MLPFConfig(distribution="qr", max_epochs=10, rich_progress_bar=False)
forecaster_qr_mlpf = TwigaForecaster(
data_params=data_config,
model_params=[qr_mlpf_config],
train_params=train_config,
)
forecaster_qr_mlpf.fit(train_df=train_df, val_df=val_df)
clear_output()
log.info("MLPGAM QR training complete.")
pred_qr_mlpf, metric_qr_mlpf = forecaster_qr_mlpf.evaluate_quantile_forecast(test_df=test_df)
clear_output()
get_metric_table(metric_qr_mlpf)
4b. NN Flexible-Quantile Model: MLPGAM FPQR#
MLPGAMFPQRConfig pairs the GAM backbone with a Flexible Proportional Quantile Regression head.
Unlike fixed-grid QR (where τ values are pre-specified), FPQR learns its own quantile grid
adaptively during training via a QuantileProposal network.
Key differences from fixed-grid QR:
Property |
Fixed QR ( |
Flexible QR (FPQR) |
|---|---|---|
Quantile levels |
Fixed (e.g. 0.05 → 0.95) |
Learned per sample |
Grid size |
Set by |
Set by |
Point forecast |
Median of fixed grid |
Weighted expectation of proposed quantiles |
Use case |
Calibrated specific quantiles |
Flexible uncertainty shape |
The quantile_levels returned are averaged over the batch for display purposes,
since each sample proposes a slightly different grid.
fpqr_mlpf_config = MLPFConfig(
distribution="fpqr",
n_quantiles=9,
conf_level=0.05,
max_epochs=10,
rich_progress_bar=False,
)
forecaster_fpqr_mlpf = TwigaForecaster(
data_params=data_config,
model_params=[fpqr_mlpf_config],
train_params=train_config,
)
forecaster_fpqr_mlpf.fit(train_df=train_df, val_df=val_df)
clear_output()
log.info("MLPF FPQR training complete.")
pred_fpqr_mlpf, metrics_fpqr_mlpf = forecaster_fpqr_mlpf.evaluate_quantile_forecast(test_df=test_df)
# clear_output()
get_metric_table(metrics_fpqr_mlpf)
# QR workflow summary
workflow_df = pd.DataFrame(
{
"Step": ["Choose a QR model", "Assemble forecaster", "Train", "Get interval predictions", "Compute metrics"],
"API call": [
"QRXGBOOSTConfig(), MLPGAMConfig(distribution='qr'), or MLPGAMConfig(distribution='fpqr')",
"TwigaForecaster(data_params, model_params, train_params)",
".fit(train_df, val_df)",
".evaluate_interval_forecast(test_df)",
"get_interval_metrics(pred, true, lower, upper, alpha)",
],
}
)
twiga_gt(
GT(workflow_df)
.tab_header(title=md("**Quantile Regression Workflow**"), subtitle="End-to-end steps in Twiga")
.cols_label(**{c: md(f"**{c}**") for c in workflow_df.columns})
.tab_source_note("Twiga Forecast"),
n_rows=len(workflow_df),
)
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
# Fixed QR vs FPQR comparison
comparison_df = pd.DataFrame(
{
"Property": ["Quantile levels", "Point forecast", "Key param", "Typical advantage"],
"Fixed QR": [
"Fixed at init (conf_level bounds + quantiles list)",
"Median quantile of fixed grid",
"quantiles=[0.1, ..., 0.9]",
"Easier to audit specific levels; stable training",
],
"FPQR": [
"Learned per sample via QuantileProposal network",
"Weighted expectation ∑ τ̂ᵢ · qᵢ",
"n_quantiles=9",
"Better uncertainty shape on complex / skewed data",
],
}
)
twiga_gt(
GT(comparison_df)
.tab_header(
title=md("**Fixed-Grid QR vs Flexible QR (FPQR)**"),
subtitle="Choose based on auditability vs. adaptability needs",
)
.cols_label(**{c: md(f"**{c}**") for c in comparison_df.columns})
.tab_source_note("Twiga Forecast"),
n_rows=len(comparison_df),
)
# ruff: noqa: E501, E701, E702
from IPython.display import HTML
_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"
steps = [
{
"num": "06",
"title": "Backtesting & Evaluation",
"desc": "Rolling-window backtesting · fold-level metrics",
"tags": ["backtesting", "evaluation"],
"active": False,
},
{
"num": "07",
"title": "Neural Networks",
"desc": "MLPF · N-HiTS · Lightning training loop",
"tags": ["neural network", "pytorch"],
"active": False,
},
{
"num": "08",
"title": "Quantile Regression",
"desc": "QR-LightGBM · QR-XGBoost · FPQR — calibrated prediction intervals",
"tags": ["probabilistic", "quantile", "pinball loss"],
"active": True,
},
{
"num": "09",
"title": "Parametric Distributions",
"desc": "Normal · Laplace · Gamma heads — NLL training",
"tags": ["parametric", "NLL", "distributions"],
"active": False,
},
{
"num": "10",
"title": "Conformal Prediction",
"desc": "Coverage-guaranteed intervals with CQR and CRC",
"tags": ["conformal", "CQR", "CRC"],
"active": False,
},
]
track_name = "Probabilistic Track"
footer = 'Next: <span style="color:#107591;font-weight:600;">Parametric Distributions</span> (09) for NLL-trained models, or <span style="color:#107591;font-weight:600;">Conformal Prediction</span> (10) for coverage guarantees.'
def _b(t, bg, fg):
return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'
ch = ""
for i, s in enumerate(steps):
a = s["active"]
cb = _TEAL if a else _WHITE
cbo = _TEAL if a else "#d1ecf1"
nb = _TEAL_MID if a else _TEAL_LIGHT
nf = _WHITE if a else _TEAL
tf = _WHITE if a else _TEXT_DARK
df = "#cce8ef" if a else _TEXT_MUTED
bb = "#0d5f75" if a else _TEAL_BEST
bf = "#b8e4ed" if a else _TEAL
yh = (
f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
if a
else ""
)
bdg = "".join(_b(t, bb, bf) for t in s["tags"])
ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
if i < len(steps) - 1:
ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'
HTML(
f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)
Fan Chart#
plot_quantile_fan renders graduated ribbon bands from widest to narrowest,
with the median line on top - the standard chart for communicating quantile
forecast uncertainty in energy and meteorological forecasting.
Interval Comparison#
plot_forecast_intervals overlays multiple models’ prediction intervals as
semi-transparent ribbons - useful for visually comparing sharpness and coverage
across methods on the same axis.