Conformal Prediction & Uncertainty Quantification#

Source Files
  • twiga/distributions/conformal/core.py

  • twiga/distributions/conformal/base.py

  • twiga/distributions/conformal/cqr.py

  • twiga/distributions/conformal/residual_conformal

  • twiga/core/config/base.py

Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees. Unlike Bayesian methods, conformal prediction makes no assumptions about the data distribution - it only requires exchangeability of calibration data.

Because calibration needs only a set of predictions and ground truth from a held-out split, conformal prediction wraps any trained Twiga model without retraining - including plain point-forecast ML models like CatBoost or XGBoost. This makes it the recommended path for adding uncertainty estimates to an existing point forecaster.

How It Works#

        graph LR
    A[Train Model] --> B[Calibrate on Held-Out Data]
    B --> C[Compute Non-Conformity Scores]
    C --> D["Calculate Threshold q̂ at (1-α) quantile"]
    D --> E[Generate Intervals on New Data]
    E --> F["[prediction - q̂, prediction + q̂]"]
    
  1. Train the model on training data

  2. Calibrate using held-out data to compute non-conformity scores

  3. Calculate threshold \(\hat{q}\) as the \((1-\alpha)\)-quantile of the scores

  4. Generate intervals on new data using the calibrated threshold

The resulting intervals have guaranteed marginal coverage of at least \(1-\alpha\).

Class Hierarchy#

        classDiagram
    class BaseConformal {
        <<abstract>>
        +alpha: float
        +calib_method: str
        +lambda_: float
        +q_hat: float | ndarray
        +calibrate(*calib_args, axis)
        +calculate_conformal_quantile(scores, axis, calib_timestamps)
        +get_scores()*
        +generate_intervals()*
    }

    class SplitConformal {
        +score_type: "res" | "sign-res"
        +get_scores(predicts, targets)
        +generate_intervals(mu_pred)
    }

    class ConformalQuantileRegressor {
        +score_type: "scaled" | "unscaled"
        +get_scores(lower_q, upper_q, targets)
        +generate_intervals(lower_q, upper_q)
    }

    class ConformalResidualFitting {
        +score_type: "res" | "sign-res"
        +q_hat_low: float | ndarray
        +get_scores(predicts, sigma, targets)
        +calibrate(predicts, sigma, targets, calib_timestamps)
        +generate_intervals(loc, sigma)
    }

    class Conformal {
        <<factory>>
        +__new__(method, score_type, alpha, calib_method, lambda_)
    }

    BaseConformal <|-- SplitConformal
    BaseConformal <|-- ConformalQuantileRegressor
    BaseConformal <|-- ConformalResidualFitting
    Conformal ..> SplitConformal : creates
    Conformal ..> ConformalQuantileRegressor : creates
    Conformal ..> ConformalResidualFitting : creates
    

Factory: Conformal#

The Conformal class in twiga/distributions/conformal/core.py is a factory that selects the appropriate method:

from twiga.distributions.conformal.core import Conformal

# Creates a SplitConformal instance
cp = Conformal(method="residual", score_type="res", alpha=0.1)

# Creates a ConformalQuantileRegressor instance
cqr = Conformal(method="quantile", score_type="scaled", alpha=0.1)

# Creates a ConformalResidualFitting instance with uniform calibration
crf = Conformal(method="residual-fitting", score_type="sign-res", alpha=0.1)

# Creates a ConformalResidualFitting instance with temporal weighting
crf_t = Conformal(
    method="residual-fitting",
    score_type="sign-res",
    alpha=0.1,
    calib_method="temporal",
    lambda_=1.0,
)

calib_method and lambda_ are forwarded only when method="residual-fitting".

Configuration#

Conformal prediction is configured via ConformalConfig:

from twiga.core.config import ConformalConfig

# Standard uniform calibration:
config = ConformalConfig(
    method="residual-fitting",
    score_type="sign-res",
    alpha=0.1,
)

# Temporal calibration weighting (addresses seasonal distribution shift):
config = ConformalConfig(
    method="residual-fitting",
    score_type="sign-res",
    alpha=0.1,
    calib_method="temporal",
    lambda_=1.0,
)

# Via coverage probability:
config = ConformalConfig.from_coverage(
    coverage=0.9,
    method="residual",
    score_type="res",
)

Parameter

Type

Constraints

Description

method

Literal["residual", "quantile", "residual-fitting"]

Required

Conformal prediction method

score_type

Literal["scaled", "unscaled", "res", "sign-res"]

Required

Non-conformity score type

alpha

float

0 < alpha < 1

Significance level (1 minus alpha = target coverage)

calib_method

Literal["uniform", "temporal"]

Default "uniform"

Quantile estimation method for calibration scores

lambda_

float

>= 0, default 1.0

Exponential decay rate for temporal weighting

ConformalConfig.from_coverage(coverage, **kwargs) is a convenience factory that converts a coverage probability to alpha = 1 - coverage and forwards any remaining keyword arguments to the constructor.

Valid Score Types per Method#

Method

Class

Valid Score Types

Description

"residual"

SplitConformal

"res", "sign-res"

Absolute or signed residuals

"quantile"

ConformalQuantileRegressor

"scaled", "unscaled"

CQR with or without scaling

"residual-fitting"

ConformalResidualFitting

"res", "sign-res"

Scale-adapted residuals

Methods#

Split Conformal (method="residual")#

The simplest method. Computes non-conformity scores as residuals between predictions and targets.

Score types:

  • "res": \(s_i = |y_i - \hat{y}_i|\) (absolute residuals)

  • "sign-res": \(s_i = y_i - \hat{y}_i\) (signed residuals)

Intervals: \([\hat{y} - \hat{q}, \; \hat{y} + \hat{q}]\)

from twiga.distributions.conformal.base import SplitConformal

cp = SplitConformal(score_type="res", alpha=0.1)
cp.calibrate(predictions, targets)
lower, upper = cp.generate_intervals(new_predictions)

Tip

Split conformal works with any point prediction model and is the easiest to set up. Use this as a starting point.

Conformal Quantile Regression (method="quantile")#

Requires a quantile regression model that produces lower and upper quantile predictions.

Score types:

  • "unscaled": \(s_i = \max(q_{lo,i} - y_i, \; y_i - q_{hi,i})\)

  • "scaled": \(s_i = \max\left(\frac{q_{lo,i} - y_i}{q_{hi,i} - q_{lo,i}}, \; \frac{y_i - q_{hi,i}}{q_{hi,i} - q_{lo,i}}\right)\)

Intervals (unscaled): \([q_{lo} - \hat{q}, \; q_{hi} + \hat{q}]\)

Intervals (scaled): \([q_{lo} - \hat{q} \cdot (q_{hi} - q_{lo}), \; q_{hi} + \hat{q} \cdot (q_{hi} - q_{lo})]\)

from twiga.distributions.conformal.cqr import ConformalQuantileRegressor

cqr = ConformalQuantileRegressor(score_type="scaled", alpha=0.1)
cqr.calibrate(lower_quantile, upper_quantile, targets)
lower, upper = cqr.generate_intervals(lower_quantile_new, upper_quantile_new)

Note

Scaled CQR adapts interval width based on the model’s quantile spread, producing narrower intervals where the model is more confident. Use with quantile regression models.

Conformal Residual Fitting (method="residual-fitting")#

Requires a model that produces both a point prediction (\(\mu\)) and a scale estimate (\(\sigma\)).

Score types:

  • "res": \(s_i = \frac{|\hat{y}_i - y_i|}{\sigma_i}\) (absolute)

  • "sign-res": \(s_i = \frac{\hat{y}_i - y_i}{\sigma_i}\) (signed, can be negative)

Intervals (res): \([\mu - \sigma \cdot \hat{q}, \; \mu + \sigma \cdot \hat{q}]\)

Intervals (sign-res): \([\mu - \sigma \cdot \hat{q}^+, \; \mu - \sigma \cdot \hat{q}^-]\) where \(\hat{q}^+ = Q_{1-\alpha/2}\) and \(\hat{q}^- = Q_{\alpha/2}\) of the signed scores, giving asymmetric bounds with \(1-\alpha\) marginal coverage.

from twiga.distributions.conformal.crc import ConformalResidualFitting

# Uniform calibration (default):
crf = ConformalResidualFitting(score_type="sign-res", alpha=0.1)
crf.calibrate(predictions, sigma, targets)
lower, upper = crf.generate_intervals(new_predictions, new_sigma)

# Temporal calibration (pass calibration timestamps):
import pandas as pd
crf_t = ConformalResidualFitting(
    score_type="sign-res",
    alpha=0.1,
    calib_method="temporal",
    lambda_=1.0,
)
calib_timestamps = pd.to_datetime(calib_df["timestamp"])
crf_t.calibrate(predictions, sigma, targets, calib_timestamps=calib_timestamps)
lower, upper = crf_t.generate_intervals(new_predictions, new_sigma)

Note

Use residual-fitting with probabilistic models like GAUSSCATBOOSTModel or neural models with sigma outputs like MLPGAFModel.

Temporal Calibration Weighting#

Standard conformal prediction assumes exchangeability: calibration and test samples come from the same distribution. In rolling-origin forecasting this assumption can be violated when calibration data is drawn from a different season than the test window.

Temporal weighting (Tibshirani et al. 2019) addresses this by assigning higher weight to calibration samples that are more recent and therefore closer in distribution to the test period.

Weight formula:

\[w_i \propto \exp(-\lambda \cdot d_i), \quad d_i = \frac{t_{\max} - t_i}{t_{\max} - t_{\min}}\]

where \(d_i \in [0, 1]\) is the normalised age of sample \(i\) (\(d_i = 0\) for the newest, \(d_i = 1\) for the oldest). Weights are normalised to sum to 1.

The weighted quantile selects the smallest score \(s_{(k)}\) such that:

\[\sum_{j=1}^{k} w_{(j)} \geq 1 - \alpha\]

Decay rate lambda_:

Value

Effect

0.0

Uniform weighting (same as calib_method="uniform")

1.0

Moderate recency bias (recommended starting point)

2.0

Strong recency bias; upweights the most recent ~30 % of calibration data

When to use temporal weighting

Use calib_method="temporal" when the calibration window is drawn from a different season than the test window, for example in expanding-window cross-validation where calibration is always at the training tail while the test window advances into a new season. Temporal weighting does not change the CV protocol and is compatible with any calib_size.

from twiga.core.config import ConformalConfig
from twiga.forecaster.core import TwigaForecaster

conformal_config = ConformalConfig(
    method="residual-fitting",
    score_type="sign-res",
    alpha=0.1,
    calib_method="temporal",
    lambda_=1.0,
)

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params=[model_cfg],
    cv_params=exp_config,
    conformal_params=conformal_config,
)

forecaster.fit(train_df=train_df)
forecaster.calibrate(calibrate_df=calib_df)   # timestamps extracted automatically
lower, point, upper = forecaster.predict_interval(test_df=test_df)

TwigaForecaster.calibrate() extracts per-sample timestamps from the calibration data automatically and passes them to ConformalResidualFitting.calibrate() when calib_method="temporal".

Helper Functions#

The weighting utilities are available as standalone functions for custom workflows:

import pandas as pd
from twiga.distributions.conformal.base import (
    calculate_temporal_weights,
    calculate_weighted_conformal_value,
)

timestamps = pd.date_range("2021-09-01", periods=90, freq="D")
weights = calculate_temporal_weights(timestamps, lambda_=1.0)
# weights: ndarray of shape (90,) summing to 1, newest sample gets highest weight

threshold = calculate_weighted_conformal_value(scores, alpha=0.1, weights=weights)
# threshold: scalar (1-D scores) or ndarray of shape (H,) for horizon-wise scores

Integration with TwigaForecaster#

The TwigaForecaster manages conformal prediction end-to-end:

from twiga.core.config import (
    ConformalConfig, DataPipelineConfig, ExperimentConfig,
)
from twiga.forecaster.core import TwigaForecaster
from twiga.models.ml.xgboost_model import XGBOOSTConfig

# 1. Configure with conformal prediction
conformal_config = ConformalConfig(
    method="residual",
    score_type="res",
    alpha=0.1,
)

forecaster = TwigaForecaster(
    data_params=data_config,
    model_params=[XGBOOSTConfig()],
    cv_params=train_config,
    conformal_params=conformal_config,
)

# 2. Train
forecaster.fit(train_df=train_df, val_df=val_df)

# 3. Calibrate on held-out data
forecaster.calibrate(calibrate_df=calibration_df)

# 4. Generate prediction intervals
interval_dict, times = forecaster.predict_interval(test_df=test_df)

for model_name, (lower, point, upper) in interval_dict.items():
    print(f"{model_name}: coverage target = {1 - conformal_config.alpha:.0%}")

# 5. Evaluate interval quality
predictions_df, metrics_df = forecaster.evaluate_interval_forecast(test_df=test_df)
# metrics_df includes: picp (coverage), winkle-score, ace, nmpi, cwe

Calibration Flow#

        sequenceDiagram
    participant User
    participant TF as TwigaForecaster
    participant Conf as Conformal Factory
    participant Model

    User->>TF: calibrate(calibrate_df)
    TF->>Model: predict(calibrate_df)
    Model-->>TF: predictions
    TF->>TF: get_ground_truth()
    loop For each model
        TF->>Conf: Conformal(method, score_type, alpha)
        Conf-->>TF: conformal_instance
        TF->>Conf: calibrate(predictions, targets)
        Conf->>Conf: get_scores() → calculate_conformal_quantile()
        Conf-->>TF: calibrated (q_hat set)
        TF->>TF: store in self.conformal[model_name]
    end
    

API Reference#

twiga.distributions.conformal.base.calculate_conformal_value(scores, alpha, axis=0)#

Calculate the 1-alpha quantile of scores for conformal prediction using NumPy.

This function computes the threshold value (quantile) used to construct prediction sets based on the given non-conformity scores and significance level alpha. If the scores are empty or the quantile value exceeds 1, it returns an array of np.inf.

Parameters:
  • scores (ndarray) – Non-conformity scores with shape (N, …).

  • alpha (float) – Significance level, must be between 0 and 1.

  • axis (int) – Axis along which to compute the quantile. Defaults to 0.

Return type:

ndarray

Returns:

np.ndarray – The threshold value used to construct prediction sets, with shape (scores.shape[1:]).

Raises:

ValueError – If alpha is not between 0 and 1.

twiga.distributions.conformal.base.calculate_temporal_weights(timestamps, lambda_)#

Compute exponentially decaying temporal weights for calibration samples.

Weights are proportional to exp(-lambda_ * d_i) where d_i is the normalised age of sample i relative to the most recent calibration point (d_i = 0 for the newest sample, d_i = 1 for the oldest). The resulting weights are normalised to sum to 1.

A lambda_ of 0 recovers uniform weights. Larger values concentrate weight on more recent samples, reducing the influence of calibration samples that may have been drawn from a different seasonal regime.

Parameters:
  • timestamps (Series | ndarray) – Sequence of calibration timestamps (any datetime-like or numeric type). Must be 1-D with length equal to the number of calibration samples.

  • lambda – Exponential decay rate. Must be >= 0.

Return type:

ndarray

Returns:

1-D array of normalised weights summing to 1, shape (N,).

Raises:

ValueError – If lambda_ is negative or timestamps is empty.

twiga.distributions.conformal.base.calculate_weighted_conformal_value(scores, alpha, weights)#

Compute the weighted (1-alpha) quantile of non-conformity scores.

Implements the weighted conformal prediction quantile of Tibshirani et al. (2019). For each horizon column, scores are sorted ascending and the smallest score whose cumulative normalised weight reaches 1 - alpha is returned as the threshold.

Parameters:
  • scores (ndarray) – Non-conformity scores with shape (N,) or (N, H).

  • alpha (float) – Significance level in (0, 1).

  • weights (ndarray) – Normalised sample weights with shape (N,). Must sum to 1.

Return type:

ndarray

Returns:

Threshold(s) with shape () (scalar) for 1-D input or (H,) for 2-D input. Returns np.inf for any horizon where the cumulative weight cannot reach 1 - alpha.

Raises:

ValueError – If alpha is outside (0, 1) or shapes are incompatible.

class twiga.distributions.conformal.base.BaseConformal(alpha=0.1, calib_method='uniform', lambda_=1.0)#

Bases: ABC

Abstract base class for Conformal Prediction in Regression.

Provides core functionality for constructing prediction intervals with finite-sample coverage guarantees.

Variables:
  • alpha (float) – Significance level for prediction intervals (0 < alpha < 1).

  • calib_method (str) – Quantile estimation method — "uniform" (default) or "temporal" (exponentially weighted by sample recency).

  • lambda (float) – Exponential decay rate for temporal weighting. Ignored when calib_method="uniform".

  • q_hat (float | np.ndarray) – Calibrated threshold(s) for intervals.

Parameters:
  • alpha (float) – Significance level (0 < alpha < 1). Defaults to 0.1.

  • calib_method (str) – "uniform" or "temporal". Defaults to "uniform".

  • lambda – Decay rate for temporal weighting. Defaults to 1.0.

Raises:

ValueError – If alpha is not in (0, 1) or calib_method is invalid.

__init__(alpha=0.1, calib_method='uniform', lambda_=1.0)#

Initializes base conformal predictor with validation.

calculate_conformal_quantile(scores, axis=0, calib_timestamps=None, alpha=None)#

Computes the (1-alpha)-adjusted quantile of non-conformity scores.

Dispatches to the uniform empirical quantile (Lei et al. 2017) or the temporally weighted quantile (Tibshirani et al. 2019) depending on self.calib_method.

Parameters:
  • scores (ndarray) – Non-conformity scores array with shape (N,) or (N, H).

  • axis (int) – Quantile computation axis (uniform method only). Defaults to 0.

  • calib_timestamps (Series | ndarray | None) – Calibration sample timestamps required when calib_method="temporal". Ignored for "uniform".

  • alpha (float | None) – Significance level override. When None (default) self.alpha is used. Pass self.alpha / 2 for split-quantile calibration (e.g. sign-res score type) to compute each tail at half the nominal level, achieving valid 1-alpha joint coverage.

Return type:

float | ndarray

Returns:

Quantile values for interval construction.

Raises:

ValueError – For empty scores or missing timestamps in temporal mode.

calibrate(*calib_args, axis=0)#

Calibrates conformal thresholds using provided arguments.

Parameters:
  • *calib_args – Implementation-specific calibration data

  • axis (int) – Axis for quantile computation. Defaults to 0.

Raises:

ValueError – If calibration data validation fails

Return type:

None

abstractmethod generate_intervals(*pred_args)#

Generates prediction intervals from inputs.

Must be implemented by concrete subclasses.

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (lower bounds, upper bounds)

abstractmethod get_scores(*args)#

Computes non-conformity scores from inputs.

Must be implemented by concrete subclasses.

Return type:

ndarray

Returns:

Array of non-conformity scores

class twiga.distributions.conformal.base.SplitConformal(score_type='res', alpha=0.1, calib_method='uniform', lambda_=1.0)#

Bases: BaseConformal

Implements residual-based conformal prediction for regression tasks.

Variables:

score_type (str) – Type of non-conformity score (‘res’ or ‘sign-res’).

generate_intervals(mu_pred)#

Generates prediction intervals.

Parameters:

mu_pred (ndarray) – Model predictions.

Return type:

tuple[ndarray, ndarray]

Returns:

tuple[np.ndarray, np.ndarray] – Lower and upper bounds.

get_scores(predicts, targets)#

Computes non-conformity scores based on residuals.

Parameters:
  • predicts (ndarray) – Model predictions.

  • targets (ndarray) – Ground truth targets.

Return type:

ndarray

Returns:

np.ndarray – Non-conformity scores.

class twiga.distributions.conformal.cqr.ConformalQuantileRegressor(score_type='scaled', alpha=0.1)#

Bases: BaseConformal

Conformal quantile regression for uncertainty estimation.

Implements conformal prediction intervals for quantile regression models using either scaled or unscaled non-conformity scores.

Variables:
  • score_type (str) – Type of non-conformity score (‘scaled’ or ‘unscaled’)

  • alpha (float) – Significance level for prediction intervals

  • q_hat (float | np.ndarray) – Calibrated threshold(s)

  • _EPS (float) – Numerical stability constant

Parameters:
  • score_type (str) – Score calculation method. Defaults to ‘scaled’.

  • alpha (float) – Significance level (0 < alpha < 1). Defaults to 0.1.

Raises:

ValueError – For invalid score_type or alpha values.

generate_intervals(lower_quantile, upper_quantile)#

Generates calibrated prediction intervals.

Parameters:
  • lower_quantile (ndarray) – Predicted lower quantiles

  • upper_quantile (ndarray) – Predicted upper quantiles

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (lower intervals, upper intervals)

get_scores(lower_quantile, upper_quantile, targets)#

Computes non-conformity scores for quantile predictions.

Parameters:
  • lower_quantile (ndarray) – Predicted lower quantiles

  • upper_quantile (ndarray) – Predicted upper quantiles

  • targets (ndarray) – Ground truth values

Return type:

ndarray

Returns:

Array of non-conformity scores

class twiga.distributions.conformal.crc.ConformalResidualFitting(score_type='res', alpha=0.1, calib_method='uniform', lambda_=1.0)#

Bases: BaseConformal

Conformal residual fitting for uncertainty estimation.

Implements conformal prediction intervals using residual-based scoring with optional scale adaptation.

For score_type="res" (default), symmetric intervals are built from the (1-alpha)-quantile of absolute scaled residuals.

For score_type="sign-res", asymmetric intervals are built using a split-quantile approach: the (1-alpha/2)-quantile of signed scores gives the lower bound, and the (alpha/2)-quantile gives the upper bound, achieving valid marginal coverage at level 1-alpha.

Temporal weighting (calib_method="temporal") applies exponentially decaying weights to calibration samples by recency, addressing distribution shift between the calibration and test periods (Tibshirani et al. 2019).

Variables:
  • score_type (str) – Score calculation method (‘res’ or ‘sign-res’).

  • alpha (float) – Significance level for intervals.

  • calib_method (str) – "uniform" or "temporal".

  • lambda (float) – Exponential decay rate for temporal weighting.

  • q_hat (float | np.ndarray) – Upper-tail calibrated threshold (lower bound).

  • q_hat_low (float | np.ndarray) – Lower-tail calibrated threshold (upper bound, sign-res only).

Parameters:
  • score_type (str) – Residual scoring method. Defaults to 'res'.

  • alpha (float) – Significance level (0 < alpha < 1). Defaults to 0.1.

  • calib_method (str) – "uniform" (default) or "temporal".

  • lambda – Exponential decay rate for temporal weighting. Defaults to 1.0.

Raises:

ValueError – For invalid score_type or alpha values.

__init__(score_type='res', alpha=0.1, calib_method='uniform', lambda_=1.0)#

Initializes residual fitter with validation.

calibrate(predicts, sigma, targets, axis=0, calib_timestamps=None)#

Calibrates conformal thresholds from calibration data.

For sign-res, computes split quantiles at alpha/2 and 1-alpha/2 to construct asymmetric intervals with valid 1-alpha marginal coverage. For res, delegates to the base (1-alpha)-quantile of absolute scores.

Parameters:
  • predicts (ndarray) – Model predictions on calibration set.

  • sigma (ndarray) – Scale factors on calibration set.

  • targets (ndarray) – Ground truth values on calibration set.

  • axis (int) – Axis for quantile computation. Defaults to 0.

  • calib_timestamps (Series | ndarray | None) – Calibration sample timestamps required when calib_method="temporal". Ignored for "uniform".

Return type:

None

generate_intervals(loc, sigma)#

Generates calibrated prediction intervals.

For sign-res, produces asymmetric bounds using the two split quantiles set by calibrate(). For res, produces symmetric bounds.

Parameters:
Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (lower bound, upper bound) arrays.

get_scores(predicts, sigma, targets)#

Computes scaled residual scores.

Parameters:
  • predicts (ndarray) – Model predictions

  • sigma (ndarray) – Scale factors

  • targets (ndarray) – Ground truth values

Return type:

ndarray

Returns:

Array of non-conformity scores

class twiga.distributions.conformal.core.Conformal(method, score_type='res', alpha=0.1, calib_method='uniform', lambda_=1.0)#

Bases: object

Factory class to create method-specific conformal predictors.