Evaluation Metrics#

Source Files
  • twiga/core/metrics/point.py

  • twiga/core/metrics/interval.py

  • twiga/core/metrics/quantile.py

  • twiga/core/metrics/parametric.py

  • twiga/core/metrics/stats.py

Twiga provides a comprehensive set of metrics for evaluating point forecasts, prediction intervals, quantile forecasts, and full probabilistic forecasts. All metrics operate on NumPy arrays and are accessible via evaluate_forecast().

Point Forecast Metrics#

Defined in twiga/core/metrics/point.py. Used via evaluate_point_forecast().

Metric

Key

Formula

Description

Mean Error

res

\(\frac{1}{T}\sum(y_t - \hat{y}_t)\)

Average signed error (bias indicator)

MAE

mae

\(\frac{1}{T}\sum|y_t - \hat{y}_t|\)

Mean Absolute Error

MedAE

medae

\(\text{median}(|y_t - \hat{y}_t|)\)

Median Absolute Error — robust to outliers

MSE

mse

\(\frac{1}{T}\sum(y_t - \hat{y}_t)^2\)

Mean Squared Error

RMSE

rmse

\(\sqrt{MSE}\)

Root Mean Squared Error

RMSLE

rmsle

\(\sqrt{\frac{1}{T}\sum(\log(1+\hat{y}_t)-\log(1+y_t))^2}\)

Root Mean Squared Log Error — scale-invariant, requires \(y,\hat{y}\geq 0\)

NRMSE

nrmse

\(\frac{RMSE}{\text{scale}}\)

Normalized RMSE (scale: max, range, or mean)

MAPE

mape

\(\frac{100}{T}\sum\frac{|y_t - \hat{y}_t|}{|y_t|}\)

Mean Absolute Percentage Error

sMAPE

smape

\(\frac{200}{T}\sum\frac{|y_t - \hat{y}_t|}{|y_t| + |\hat{y}_t|}\)

Symmetric MAPE

WMAPE

wmape

\(100 \times \frac{\sum|y_t - \hat{y}_t|}{\sum|y_t|}\)

Weighted MAPE

OPE

ope

\(100 \times \frac{|\sum y_t - \sum \hat{y}_t|}{\sum y_t}\)

Overall Percentage Error

MASE

mase

\(\frac{MAE}{MAE_{\text{naive}}}\)

Mean Absolute Scaled Error (naive = lag-1)

MSSE

msse

\(\frac{MSE}{MSE_{\text{naive}}}\)

Mean Squared Scaled Error

RMSSE

rmsse

\(\frac{RMSE}{RMSE_{\text{naive}}}\)

Root Mean Squared Scaled Error

r2

\(1 - \frac{\sum(y_t-\hat{y}_t)^2}{\sum(y_t-\bar{y})^2}\)

Coefficient of Determination — proportion of variance explained

Correlation

corr

\(\rho(y, \hat{y})\)

Pearson correlation coefficient

Normalized Bias

nbias

\(\sum\frac{y_t - \hat{y}_t}{y_t + \hat{y}_t}\)

Normalized bias error

Bold entries are new additions aligned with standard libraries (NeuralForecast, Darts, scikit-learn).

Skill Score#

The Skill Score measures relative improvement over a naive baseline:

\[SS = 1 - \frac{\text{metric}(y, \hat{y})}{\text{metric}(y, y_{\text{naive}})}\]

It is a standalone function (not part of get_pointwise_metrics) because it requires a baseline forecast:

from twiga.core.metrics import skill_score

ss = skill_score(y, y_hat, y_naive=seasonal_naive_forecast, metric="mae")
# ss > 0: model beats the baseline; ss = 0: no improvement; ss < 0: worse

Using Point Metrics#

from twiga.core.metrics.point import evaluate_point_forecast

metrics_df = evaluate_point_forecast(result, metric_names=["mae", "rmse", "r2", "smape", "corr"])

Available metric keys: "res", "mae", "mase", "mse", "msse", "rmse", "rmsse", "mape", "wmape", "smape", "ope", "corr", "nbias", "r2", "rmsle", "medae".

If metric_names is None, all available metrics are computed.

Interval Forecast Metrics#

Defined in twiga/core/metrics/interval.py. Used via evaluate_interval_forecast().

Metric

Key

Formula

Description

Winkler Score

winkler_score

\(\frac{1}{n}\!\sum_t\!\left[(U_t\!-\!L_t) + \frac{2}{\alpha}(L_t\!-\!y_t)\mathbf{1}_{y_t < L_t} + \frac{2}{\alpha}(y_t\!-\!U_t)\mathbf{1}_{y_t > U_t}\right]\)

Penalises both interval width and missed coverage

Coverage (PICP)

picp

\(\frac{1}{n}\sum_t \mathbf{1}[L_t \leq y_t \leq U_t]\)

Proportion of true values within prediction intervals

ACE

ace

\(PICP - (1 - \alpha)\)

Average Coverage Error: signed deviation from the nominal level

NMPI

nmpi

\(\frac{\text{median}(U_t - L_t)}{\text{scale}}\)

Normalised Median Prediction Interval width

CWE

cwe

\(\lambda \cdot \frac{(1+\beta^2)\,\gamma_{\text{picp}}\,\gamma_{\text{nmpi}}}{\beta^2\,\gamma_{\text{nmpi}} + \gamma_{\text{picp}}}\)

Coverage–Width–Error: harmonic mean of coverage and width scores, weighted by forecast accuracy

γ_picp

\(\min\!\left(\frac{\text{PICP}}{1-\alpha},\, 1\right)\)

Coverage adequacy: linear ramp from 0 to 1 as PICP reaches the target; saturates at 1 for over-coverage

γ_nmpi

\(\min\!\left(\left(\frac{\kappa/R}{\text{NMPI}}\right)^k,\, 1\right)\)

Width efficiency: full credit when NMPI ≤ reference; inverse penalty for excess width

MSIS

msis

\(\frac{\frac{1}{n}\sum_t\!\left[(U_t\!-\!L_t) + \frac{2}{\alpha}(L_t\!-\!y_t)\mathbf{1}_{y_t < L_t} + \frac{2}{\alpha}(y_t\!-\!U_t)\mathbf{1}_{y_t > U_t}\right]}{\frac{1}{T-s}\sum_{i=s+1}^{T}|y_i - y_{i-s}|}\)

Mean Scaled Interval Score — M4/M5 competition standard

MSIS#

The Mean Scaled Interval Score is the standard metric for prediction interval comparison in the M4/M5 competitions. It penalizes width and missed coverage, normalized by the naive seasonal forecast MAE:

from twiga.core.metrics import msis

score = msis(true, lower, upper, y_train=training_series, alpha=0.1, seasonality=24)

CWE — Coverage–Width–Error#

CWE is the primary interval quality metric in Twiga. It uses the F-β harmonic mean of two sub-scores, weighted by point-forecast accuracy λ:

\[\text{CWE} = \lambda \cdot \frac{(1+\beta^2)\,\gamma_{\text{picp}}\,\gamma_{\text{nmpi}}}{\beta^2\,\gamma_{\text{nmpi}} + \gamma_{\text{picp}}}\]

where \(\lambda = 1 - \text{error}\) (e.g. \(1 - \text{NRMSE}\)).

Sub-scores:

\[\gamma_{\text{picp}} = \min\!\left(\frac{\text{PICP}}{1-\alpha},\; 1\right)\]

Linear ramp from 0 at PICP = 0 to 1 at PICP = 1 − α. Over-coverage (PICP > 1 − α) does not increase this score — it is captured instead by γ_nmpi penalising the wider intervals that caused it.

\[\gamma_{\text{nmpi}} = \min\!\left(\left(\frac{\kappa/R}{\text{NMPI}}\right)^k,\; 1\right)\]

Asymmetric width score. Full credit (1) for any NMPI ≤ κ/R (sharp intervals are not penalised here). For NMPI > κ/R the score falls as the inverse ratio raised to power k — doubling the width halves the score (k = 1 default). Increase k for stricter penalisation of excess width.

Harmonic mean handles the tension naturally: a poor score on either dimension pulls CWE toward zero regardless of the other. An over-wide interval (high NMPI) is penalised even when PICP is perfect.

Reference κ/R: computed from the target distribution as spread ÷ scale. Spread defaults to IQR (robust); scale defaults to range (max − min). Both are configurable:

from twiga.core.metrics.interval import get_interval_metrics

# Default: IQR spread, range scale — best for skewed/heavy-tailed energy data
df = get_interval_metrics(true, lower, upper, alpha=0.1)

# MAD spread (maximum robustness for bimodal data, e.g. solar with nights)
df = get_interval_metrics(true, lower, upper, alpha=0.1, spread="mad")

# std spread (only for approximately Gaussian targets)
df = get_interval_metrics(true, lower, upper, alpha=0.1, spread="std")

# Custom scale denominator
df = get_interval_metrics(true, lower, upper, alpha=0.1, nmpi_scale="mean")

# Pass your own κ directly (e.g. domain knowledge on expected width)
df = get_interval_metrics(true, lower, upper, alpha=0.1, true_nmpi=15.0)

spread

κ formula

Best for

"iqr" (default)

q₇₅ − q₂₅

General; robust to outliers and heavy tails

"mad"

median(‖y − median(y)‖)

Bimodal or heavily skewed targets

"std"

σ

Approximately Gaussian targets only

nmpi_scale

R formula

Best for

"range" (default)

max − min

General; translation-invariant

"max"

max

Non-negative data where min ≈ 0 (equivalent to range)

"mean"

mean

When NMPI should read as fraction of mean load

"median"

median

Robust alternative to mean for skewed data

Using Interval Metrics#

from twiga.core.metrics.interval import evaluate_interval_forecast

metrics_df = evaluate_interval_forecast(result, alpha=0.1)
# Returns: timestamp | target | res | mae | ... | winkler_score | picp | ace | nmpi | cwe

# Interval-metric tuning with MAD spread
metrics_df = evaluate_interval_forecast(result, alpha=0.1, spread="mad", nmpi_scale="range")

Note

evaluate_interval_forecast() computes both point metrics and interval metrics for each forecast day. The alpha parameter should match the significance level used during conformal calibration. CWE is always in [0, 1] — higher is better.

Quantile Forecast Metrics#

Defined in twiga/core/metrics/quantile.py. Used via evaluate_quantile_forecast().

Metric

Key

Formula

Description

Pinball Loss

pinball

\(\frac{1}{KT}\sum_{\tau,t}\rho_\tau(y_t, \hat{q}_{\tau,t})\) where \(\rho_\tau(y,q)=\max(\tau(y-q),\,(\tau-1)(y-q))\)

Mean pinball (check) loss across all quantile levels

WQL

wql

$\frac{2\sum_\tau\sum_t\rho_\tau(y_t, \hat{q}_{\tau,t})}{

\mathcal{T}

CRPS

crps

\(\int_{-\infty}^{\infty}\!\!(F(z) - \mathbf{1}[z \geq y])^2\,dz\) (piecewise-linear \(F\))

CRPS approximated from quantile CDF

Calibration Error

calibration_error

\(\frac{1}{K}\sum_\tau|\hat{F}(\tau) - \tau|\)

Mean absolute deviation of empirical coverage from nominal level

Sharpness

sharpness

$\frac{1}{

\mathcal{P}

PIT

histogram of \(\hat{F}(y_t)\)

Probability Integral Transform — uniform shape indicates calibration

Weighted Quantile Loss (WQL)#

WQL is the GluonTS/M5 standard aggregate quantile metric, normalized by total observation volume for cross-series comparability:

\[WQL = \frac{2 \sum_{\tau} \sum_{t} \rho_\tau(y_t, \hat{q}_{\tau,t})}{|\mathcal{T}| \cdot \sum_t |y_t|}\]
from twiga.core.metrics import wql

score = wql(true, quantile_preds, quantile_levels=[0.1, 0.5, 0.9], quantile_axis=0)

Using Quantile Metrics#

from twiga.core.metrics.quantile import evaluate_quantile_forecast

metrics_df = evaluate_quantile_forecast(result, alpha=0.1)
# Columns: point metrics + picp | ace | nmpi | cwe | pinball | wql | crps | calibration_error | sharpness

Probabilistic Metrics#

Defined in twiga/core/metrics/parametric.py. Used via evaluate_parametric_forecast().

Metric

Key

Formula

Description

CRPS

crps

\(\mathbb{E}_F|X - y| - \tfrac{1}{2}\mathbb{E}_F|X - X'|\)

Sample-based CRPS — proper scoring rule generalising MAE to distributions

CRPS (PWM)

crps_pwm

(PWM estimator of the same quantity)

CRPS via Probability Weighted Moments — more stable for small sample counts

DSS

dss

\(\frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2\log\sigma_t\)

Dawid-Sebastiani Score — proper rule jointly rewarding calibration and sharpness

Energy Score

energy_score

\(\mathbb{E}_F|\mathbf{X}-\mathbf{y}|_2 - \tfrac{1}{2}\mathbb{E}_F|\mathbf{X}-\mathbf{X}'|_2\)

Multivariate generalisation of CRPS for joint forecast distributions

Brier Score

brier_score

\(\frac{1}{n}\sum_t(\hat{p}_t - o_t)^2\)

Probabilistic binary event forecast accuracy

NLL

nll

\(-\frac{1}{n}\sum_t\log f(y_t\mid\theta_t)\)

Negative log-likelihood under the predictive distribution

Dawid-Sebastiani Score (DSS)#

DSS is a proper scoring rule for location-scale parametric forecasts, jointly rewarding sharpness and calibration:

\[DSS(y_t, \mu_t, \sigma_t) = \frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2\log\sigma_t\]
from twiga.core.metrics.parametric import dawid_sebastiani_score

dss = dawid_sebastiani_score(true, loc=predicted_means, scale=predicted_stds)

Energy Score#

The Energy Score generalises CRPS to multivariate (joint horizon) forecast distributions:

\[ES(F, \mathbf{y}) = \mathbb{E}_F\|\mathbf{X} - \mathbf{y}\|_2 - \tfrac{1}{2}\mathbb{E}_F\|\mathbf{X} - \mathbf{X}'\|_2\]
from twiga.core.metrics.parametric import energy_score

# samples shape: (n_ensemble_members, forecast_horizon)
es = energy_score(true=true_vector, samples=ensemble_samples)

Brier Score#

Measures accuracy of probabilistic binary event forecasts (e.g., exceedance probability):

\[BS = \frac{1}{n}\sum_t(\hat{p}_t - o_t)^2 \quad \hat{p}_t = \frac{1}{M}\sum_i \mathbb{1}(x_{i,t} \leq \text{threshold})\]
from twiga.core.metrics.parametric import brier_score

bs = brier_score(true, samples=ensemble_samples, threshold=peak_demand)

Using Probabilistic Metrics#

from twiga.core.metrics.parametric import evaluate_parametric_forecast

# Sample-based forecasts (CRPS + Energy Score)
metrics_df = evaluate_parametric_forecast(samples_result)

# Parametric forecasts (NLL + DSS)
metrics_df = evaluate_parametric_forecast(parametric_result)

Or via the low-level helper for direct array inputs:

from twiga.core.metrics.parametric import get_probabilistic_metrics

# Parametric mode (loc + scale)
metrics_df = get_probabilistic_metrics(true, loc=mu, scale=sigma)

# Sample-based mode
metrics_df = get_probabilistic_metrics(true, samples=ensemble_samples)

Analytical Parametric Scoring#

Defined in twiga/core/metrics/parametric.py. Closed-form implementations for common parametric distributions — no Monte Carlo sampling needed.

Metric

Key

Distribution

Formula

Description

Normal NLL

normal_nll

\(\mathcal{N}(\mu,\sigma^2)\)

\(\frac{1}{n}\sum_t\left[\frac{1}{2}\log(2\pi\sigma_t^2) + \frac{(y_t-\mu_t)^2}{2\sigma_t^2}\right]\)

Negative log-likelihood under a Normal predictive distribution

Normal CRPS

normal_crps

\(\mathcal{N}(\mu,\sigma^2)\)

\(\sigma\!\left[z(2\Phi(z)-1) + 2\phi(z) - \tfrac{1}{\sqrt{\pi}}\right],\; z=\tfrac{y-\mu}{\sigma}\)

Exact analytical CRPS — no quantile approximation needed

Laplace NLL

laplace_nll

\(\text{Laplace}(\mu,b)\)

\(\frac{1}{n}\sum_t\left[\log(2b_t) + \frac{|y_t-\mu_t|}{b_t}\right]\)

Negative log-likelihood under a Laplace predictive distribution

from twiga.core.metrics.parametric import normal_nll, normal_crps, laplace_nll

nll = normal_nll(true, loc=mu, scale=sigma)
crps = normal_crps(true, loc=mu, scale=sigma)   # exact closed form

The analytical normal_crps uses:

\[CRPS(\mathcal{N}(\mu,\sigma^2), y) = \sigma\left[z(2\Phi(z)-1) + 2\phi(z) - \frac{1}{\sqrt{\pi}}\right], \quad z = \frac{y-\mu}{\sigma}\]

Statistical Significance Testing#

Defined in twiga/core/metrics/stats.py.

Both tests share the same interface: they accept raw forecast arrays (p_real, p_pred_1, p_pred_2) of shape (n_days, n_steps_per_day) and return a one-sided p-value. A small p-value (< 0.05) means p_pred_2 is significantly more accurate than p_pred_1.

Test

Key

Statistic

Formula

What it tests

DM

diebold_mariano_test

\(\mathcal{N}(0,1)\)

\(DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}}\)

Unconditional equal predictive accuracy

GW

giacomini_white_test

\(\chi^2(q)\)

\(T \cdot R^2\) from OLS on instruments \(h_t = [1, d_{t-1}]\)

Conditional predictive accuracy (exploits information set)

Diebold-Mariano (DM) Test#

Tests unconditional equal accuracy. The loss differential \(d_t = g(e_{1,t}) - g(e_{2,t})\) is tested against zero using a HAC Bartlett variance estimator to handle serial correlation in multi-step forecasts:

\[DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}} \xrightarrow{d} \mathcal{N}(0,1)\]
from twiga.core.metrics import diebold_mariano_test

# Multivariate (single test, averaging across steps)
p = diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version="multivariate", h=24)

# Univariate (one test per step, e.g. per hour in a 24h horizon)
p_per_hour = diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version="univariate", h=24)
# p_per_hour.shape == (24,)

Giacomini-White (GW) Test#

Tests conditional predictive accuracy — it checks whether one forecast systematically exploits information the other ignores. Uses instruments \(h_t = [1, d_{t-1}]\) in an OLS regression; the test statistic is \(T \cdot R^2 \sim \chi^2(q)\):

from twiga.core.metrics import giacomini_white_test

# Multivariate (single joint test)
p = giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version="multivariate")

# Univariate (per-step tests)
p_per_hour = giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version="univariate")

When to use which test:

  • Use DM for a straightforward comparison of two forecasts without conditioning on past information.

  • Use GW when you suspect one model exploits time-varying information (e.g., regime changes, calendar effects) that the other does not — GW is more powerful in these situations.

Visualising the test matrix#

Both functions return arrays when version="univariate", making it straightforward to build a model comparison heatmap:

import pandas as pd, numpy as np
from twiga.core.metrics import diebold_mariano_test

models = {"Model A": pred_a, "Model B": pred_b, "Naive": pred_naive}
p_matrix = pd.DataFrame(index=models, columns=models, dtype=float)

for m1, p1 in models.items():
    for m2, p2 in models.items():
        if m1 == m2:
            p_matrix.loc[m1, m2] = 1.0
        else:
            p_matrix.loc[m1, m2] = diebold_mariano_test(p_real, p1, p2, version="multivariate")

Unified Dispatch#

Use evaluate_forecast() as the single entry point — it dispatches automatically based on result.kind:

from twiga.core.metrics import evaluate_forecast

metrics_df = evaluate_forecast(result)          # auto-dispatches
metrics_df = evaluate_forecast(result, alpha=0.1)  # for interval/quantile kinds

Supported kind values:

kind

Evaluate function

Key metrics

"point"

evaluate_point_forecast

MAE, RMSE, R², MedAE, RMSLE, …

"interval"

evaluate_interval_forecast

point metrics + PICP, ACE, NMPI, MSIS, CWE

"quantile"

evaluate_quantile_forecast

point + interval + pinball, WQL, CRPS, sharpness

"samples"

evaluate_parametric_forecast

point metrics + CRPS, Energy Score

"parametric"

evaluate_parametric_forecast

point metrics + NLL, DSS

API Reference#

twiga.core.metrics.evaluate_forecast(result, **kwargs)#

Dispatch forecast evaluation to the kind-appropriate function.

This is the single entry point for evaluation so callers do not need to know or import individual evaluate functions. The correct function is selected from an internal dispatch table keyed on kind.

Parameters:
  • result (ForecastResult) – A ForecastResult whose ground_truth field is set. The kind attribute determines which evaluate function is called.

  • **kwargs (object) – Extra keyword arguments forwarded to the underlying evaluate function (e.g. alpha for interval or quantile forecasts).

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target metrics indexed by daily timestamp.

Raises:

ValueError – If result.ground_truth is None or result.kind is not one of the supported values.

Examples

>>> result = ForecastResult(kind=ForecastKind.POINT, ground_truth=y, ...)
>>> metrics_df = evaluate_forecast(result)
twiga.core.metrics.point.evaluate_point_forecast(result, metric_names=None, axis=1)#

Evaluate point forecasts by computing daily pointwise metrics.

Parameters:
  • result (ForecastResult) – ForecastResult with ground_truth set, kind=ForecastKind.POINT.

  • metric_names (list[str] | None) – Metric names to compute. When None all supported point metrics are computed.

  • axis (int | None) – Axis along which to compute aggregate metrics. If None, metrics that require an axis will use their default behavior.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target metrics indexed by daily timestamp.

twiga.core.metrics.point.r_squared(y, y_hat, axis=0)#

Calculate the Coefficient of Determination (R²) between actual and predicted values.

R² measures the proportion of variance in the actual values explained by the predictions: [ R^2 = 1 - frac{sum_{t=1}^{T} (y_t - hat{y}_t)^2}{sum_{t=1}^{T} (y_t - bar{y})^2} ] where \(\bar{y}\) is the mean of the actual values. A perfect forecast yields = 1; a constant mean prediction yields = 0; worse-than-mean predictions can yield negative values.

Parameters:
  • y (TypeAliasType) – Actual values, shape (n_samples,) or (n_samples, n_features).

  • y_hat (TypeAliasType) – Predicted values, same shape as y.

  • axis (int | None) – Axis along which to compute R². If None, flatten arrays. Defaults to 0.

Return type:

float | ndarray

Returns:

R² coefficient, scalar if 1D input, array if multi-dimensional. Returns NaN if the variance of y is zero.

Raises:

ValueError – If y and y_hat have incompatible shapes.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
>>> y_hat = np.array([1.1, 2.0, 2.9, 4.1, 5.0])
>>> r_squared(y, y_hat)
0.994
twiga.core.metrics.point.rmsle(y, y_hat, axis=0)#

Calculate the Root Mean Squared Log Error (RMSLE).

The RMSLE penalises under-predictions more than over-predictions and is scale-invariant on a log scale: [ RMSLE = sqrt{frac{1}{T} sum_{t=1}^{T} bigl(log(1 + hat{y}_t) - log(1 + y_t)bigr)^2} ]

Parameters:
  • y (TypeAliasType) – Actual values, must be non-negative, shape (n_samples,) or (n_samples, n_features).

  • y_hat (TypeAliasType) – Predicted values, must be non-negative, same shape as y.

  • axis (int | None) – Axis along which to compute the mean. If None, flatten arrays. Defaults to 0.

Return type:

float | ndarray

Returns:

RMSLE, scalar if 1D input, array if multi-dimensional.

Raises:

ValueError – If any value in y or y_hat is negative.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0])
>>> y_hat = np.array([1.1, 1.9, 3.2])
>>> rmsle(y, y_hat)
0.07663...
twiga.core.metrics.point.medae(y, y_hat, axis=0)#

Calculate the Median Absolute Error (MedAE) between actual and predicted values.

MedAE is robust to outliers, unlike MAE: [ MedAE = text{median}!left(lvert y_t - hat{y}_t rvertright) ]

Parameters:
  • y (TypeAliasType) – Actual values, shape (n_samples,) or (n_samples, n_features).

  • y_hat (TypeAliasType) – Predicted values, same shape as y.

  • axis (int | None) – Axis along which to compute the median. If None, flatten arrays. Defaults to 0.

Return type:

float | ndarray

Returns:

Median absolute error, scalar if 1D input, array if multi-dimensional.

Raises:

ValueError – If y and y_hat have incompatible shapes.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 100.0])
>>> y_hat = np.array([1.1, 2.0, 2.9, 90.0])
>>> medae(y, y_hat)
0.1
twiga.core.metrics.point.skill_score(y, y_hat, y_naive, metric='mae', axis=0)#

Calculate the Skill Score of a forecast relative to a naive baseline.

The Skill Score quantifies the relative improvement of a model over a naive baseline forecast: [ SS = 1 - frac{text{metric}(y, hat{y})}{text{metric}(y, y_{text{naive}})} ] A positive score indicates improvement over the baseline; 0 means no improvement; negative values indicate the model is worse than the baseline.

Parameters:
  • y (TypeAliasType) – Actual values, shape (n_samples,).

  • y_hat (TypeAliasType) – Model predictions, same shape as y.

  • y_naive (TypeAliasType) – Naive baseline predictions (e.g. seasonal naïve, persistence), same shape as y.

  • metric (str) – Error metric to use for comparison. One of 'mae', 'mse', 'rmse'. Defaults to 'mae'.

  • axis (int | None) – Axis along which to compute the metric. If None, flatten arrays. Defaults to 0.

Return type:

float | ndarray

Returns:

Skill score(s). Scalar if 1D input or axis=None, otherwise array.

Raises:
  • ValueError – If metric is not one of the supported values.

  • ValueError – If the naive baseline metric is zero.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 4.0])
>>> y_hat = np.array([1.1, 2.1, 3.0, 4.0])
>>> y_naive = np.array([1.0, 1.0, 1.0, 1.0])
>>> skill_score(y, y_hat, y_naive, metric="mae")
0.9333...
twiga.core.metrics.interval.evaluate_interval_forecast(result, alpha=0.01, true_nmpi=None, spread='iqr', nmpi_scale='range', axis=1, metric_names=None)#

Evaluate interval forecasts by computing daily point and interval metrics.

Parameters:
  • result (ForecastResult) – ForecastResult with ground_truth, lower, and upper set, kind=ForecastKind.INTERVAL.

  • alpha (float) – Significance level used for Winkler score and coverage computations. Must be in (0, 1). Defaults to 0.01.

  • true_nmpi (float | None) – Override for κ — absolute spread of the target used as the CWE reference numerator. When None, derived from spread.

  • spread (Literal['iqr', 'mad', 'std']) – Spread measure for the CWE reference κ. "iqr" (default), "mad", or "std". See get_interval_metrics().

  • nmpi_scale (Literal['range', 'max', 'mean', 'median']) – Denominator R for NMPI and κ/R. "range" (default), "max", "mean", or "median".

  • axis (int | None) – Axis along which to compute aggregate metrics.

  • metric_names (list[str] | None) – List of interval metric names to compute.

Return type:

DataFrame

Returns:

DataFrame of per-day, per-target point and interval metrics indexed by daily timestamp.

twiga.core.metrics.interval.msis(true, lower, upper, y_train, alpha, seasonality=1, axis=None)#

Calculate the Mean Scaled Interval Score (MSIS).

[ text{MSIS} = frac{text{Mean}(W_t)}{text{NaiveMAE}_{text{seasonal}}} ]

Parameters:
  • true (TypeAliasType) – True values for forecast period.

  • lower (TypeAliasType) – Lower bounds.

  • upper (TypeAliasType) – Upper bounds.

  • y_train (TypeAliasType) – Historical training data.

  • alpha (float) – Significance level.

  • seasonality (int) – Seasonal period.

  • axis (int | None) – Axis for Winkler score averaging.

Return type:

float | ndarray

Returns:

MSIS (scalar or array).

twiga.core.metrics.quantile.evaluate_quantile_forecast(result, alpha=0.01, true_nmpi=None, metric_names=None, quantile_axis=1, axis=1)#

Evaluate quantile forecasts by computing daily point, interval, and quantile metrics.

This function loops over days and targets (using _evaluate_loop) and aggregates point metrics (from get_pointwise_metrics()), interval metrics (from get_interval_metrics()), and quantile metrics (from get_quantile_metrics()) into a single DataFrame.

Parameters:
  • result (ForecastResult) – ForecastResult with ground_truth, quantiles, and quantile_levels set, and kind=ForecastKind.QUANTILE.

  • alpha (float) – Significance level for interval metrics; the outermost quantile pair (first and last) is used as lower/upper bounds. Defaults to 0.01.

  • true_nmpi (float | None) – Optional reference value for NMPI normalisation. When None the standard deviation of the true values is used.

  • metric_names (list[str] | None) – Point metric names to compute. If None, all supported point metrics are computed.

  • quantile_axis (int) – Axis of quantile_preds that holds the quantiles. If None, it is inferred (first axis if its length matches, otherwise second).

  • axis (int) – Axis or axes passed to get_quantile_metrics() (aggregation over the forecast horizon). Default is 0.

Return type:

DataFrame

Returns:

DataFrame of per‑day, per‑target metrics indexed by daily timestamp. Columns include point metrics, interval metrics, and quantile metrics.

twiga.core.metrics.quantile.wql(true, quantile_preds, quantile_levels, quantile_axis=None, axis=1, double_factor=True, normalize_by_volume=True)#

Weighted Quantile Loss (WQL), also known as the scaled pinball loss.

WQL is the GluonTS / M5 standard aggregate quantile metric:

\[\text{WQL} = \frac{2 \mathbf{1}_{\text{double}} \sum_{\tau,t} \rho_\tau(y_t, \hat{q}_{\tau,t})} {|\mathcal{T}| \cdot \sum_t |y_t| \mathbf{1}_{\text{normalize}}}\]

where \(\rho_\tau\) is the pinball loss. The factor \(2\) is often included to make the loss at \(\tau=0.5\) equal to the MAE.

Parameters:
  • true (ndarray) – Ground truth values, any shape.

  • quantile_preds (ndarray) – Predicted quantiles. Must have one axis holding the quantiles; the remaining dimensions must match true.shape after axis movement.

  • quantile_levels (ndarray | list[float] | tuple[float, ...]) – Quantile levels \(\tau \in (0,1)\), strictly increasing.

  • quantile_axis (int | None) – Axis of quantile_preds that holds the quantiles. If None, it is inferred (first axis if its length matches, otherwise second).

  • axis (int) – Axis or axes along which to aggregate the WQL. The total volume (sum of \(|y|\)) is computed over this axis, and the pinball loss is summed over the same axis. If None, aggregate over all dimensions and return a scalar.

  • double_factor (bool) – Whether to multiply by \(2\) (default True).

  • normalize_by_volume (bool) – Whether to divide by the total absolute volume \(\sum_t |y_t|\) (default True).

Return type:

float | ndarray

Returns:

If axis is None, a scalar WQL. Otherwise, an array with the same shape as true after removing the dimensions specified by axis.

Raises:

ValueError – If inputs are empty, levels out of range, shape mismatch, or total observation volume is zero when normalize_by_volume is True.

References

Alexandrov et al. (2020). GluonTS: Probabilistic and neural time series modeling in Python. Journal of Machine Learning Research, 21(116), 1–6.

twiga.core.metrics.parametric.evaluate_parametric_forecast(result, metric_names=None)#

Evaluate parametric forecasts by computing daily point and NLL metrics.

The NLL is computed under a Normal distribution assumption when result.log_likelihood is absent. Pass pre-computed log-likelihood values via that attribute to override this behaviour.

Parameters:
Return type:

DataFrame

Returns:

DataFrame of per-day, per-target point metrics and NLL indexed by daily timestamp.

Raises:

ValueError – If any scale value is non-positive.

twiga.core.metrics.parametric.get_probabilistic_metrics(true, loc=None, scale=None, samples=None, log_likelihood=None, metric_names=None, axis=1)#

Compute probabilistic and pointwise metrics for forecasts.

Supports two modes: - Parametric (Normal distribution): provide loc and scale. - Sample‑based (ensemble): provide samples.

If loc is provided, pointwise metrics (MAE, RMSE, etc.) are also computed via get_pointwise_metrics().

Parameters:
  • true (TypeAliasType) – Ground‑truth values, any shape.

  • loc (TypeAliasType | None) – Point predictions (e.g., mean of the predictive distribution). Required for parametric mode; if given, pointwise metrics are added.

  • scale (TypeAliasType | None) – Predicted standard deviations (parametric Normal). Must be >0. Required for parametric mode.

  • samples (TypeAliasType | None) – Ensemble/posterior samples, shape (n_samples, *true.shape). Required for sample‑based mode.

  • log_likelihood (TypeAliasType | None) – Pre‑computed log‑likelihood values (optional). If provided, overrides NLL computation in parametric mode.

  • metric_names (list[str] | None) – List of probabilistic metric names to compute. If None, all available metrics for the chosen mode are computed. Supported names: - Parametric: 'nll', 'crps', 'dss' - Sample‑based: 'crps', 'energy_score', 'crps_pwm', 'brier_score'

  • axis (int | None) – Axis along which to aggregate the metrics (e.g., over the forecast horizon). If None, aggregate over all dimensions (scalar result). Default is 0.

Return type:

DataFrame

Returns:

Single‑row DataFrame with columns – - All pointwise metrics (if loc was provided) - Selected probabilistic metrics

Raises:

ValueError – If neither scale nor samples is provided, or if required inputs are missing for a mode, or if scale contains non‑positive values.

twiga.core.metrics.parametric.dawid_sebastiani_score(true, loc, scale, axis=None)#

Compute the Dawid-Sebastiani Score (DSS) for parametric Normal forecasts.

The Dawid-Sebastiani Score under a Normal distribution is:

\[\text{DSS}(y_t, \mu_t, \sigma_t) = \frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2 \log \sigma_t\]

Lower values indicate better forecasts.

Parameters:
  • true (TypeAliasType) – Ground-truth values.

  • loc (TypeAliasType) – Predicted means.

  • scale (TypeAliasType) – Predicted standard deviations (\(\sigma > 0\)).

  • axis (int | None) – Axis along which to compute the mean DSS. If None (default), the mean is taken over the entire array.

Return type:

ndarray

Returns:

Mean DSS as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

References

Dawid & Sebastiani (1999). Coherent dispersion criteria for optimal experimental design. The Annals of Statistics.

twiga.core.metrics.parametric.energy_score(true, samples, axis=None)#

Compute the Energy Score (ES) for ensemble forecasts.

The Energy Score is the multivariate generalization of CRPS:

\[\text{ES}(F, \mathbf{y}) = \mathbb{E}_F\|\mathbf{X} - \mathbf{y}\|_2 - \tfrac{1}{2} \mathbb{E}_F\|\mathbf{X} - \mathbf{X}'\|_2\]
Parameters:
  • true (TypeAliasType) – Ground-truth values, shape (n_obs,) or higher.

  • samples (TypeAliasType) – Ensemble samples, shape (n_samples, n_obs) or compatible.

  • axis (int | None) – Axis along which to average (usually the observation axis). If None (default), computes globally.

Return type:

ndarray

Returns:

Energy Score as a NumPy array (scalar when axis=None). Lower is better.

References

Gneiting & Raftery (2007). Strictly proper scoring rules. JASA.

twiga.core.metrics.parametric.brier_score(true, samples, threshold, axis=None)#

Compute the Brier Score for a probabilistic binary event forecast.

The Brier Score is:

\[\text{BS} = \frac{1}{n} \sum_{t=1}^{n} \left( \hat{p}_t - o_t \right)^2\]

where \(\hat{p}_t\) is the predicted probability that \(y_t \leq\) threshold.

Parameters:
  • true (TypeAliasType) – Ground-truth values.

  • samples (TypeAliasType) – Ensemble samples, shape (n_samples, n_obs).

  • threshold (float) – Event threshold.

  • axis (int | None) – Axis along which to compute the mean Brier Score. If None (default), the mean is taken over all observations.

Return type:

ndarray

Returns:

Brier Score as a NumPy array (scalar when axis=None). Lower is better.

References

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.

twiga.core.metrics.parametric.normal_nll(true, loc, scale, axis=None)#

Negative log-likelihood under a Normal predictive distribution.

The negative log-likelihood (NLL) is:

\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \frac{1}{2} \log(2\pi \sigma_t^2) + \frac{(y_t - \mu_t)^2}{2\sigma_t^2} \right]\]
Parameters:
  • true (TypeAliasType) – Ground-truth values.

  • loc (TypeAliasType) – Predicted means (\(\mu\)).

  • scale (TypeAliasType) – Predicted standard deviations (\(\sigma > 0\)).

  • axis (int | None) – Axis along which to compute the mean. If None (default), the mean is taken over the entire array.

Return type:

ndarray

Returns:

Negative log-likelihood as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

Examples

>>> import numpy as np
>>> normal_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5]))
array(1.0439...)
twiga.core.metrics.parametric.normal_crps(true, loc, scale, axis=None)#

Analytical Continuous Ranked Probability Score (CRPS) for a Normal predictive distribution.

The closed-form CRPS under \(\mathcal{N}(\mu, \sigma^2)\) is:

\[\text{CRPS}(\mathcal{N}(\mu, \sigma^2), y) = \sigma \left[ z \left( 2\Phi(z) - 1 \right) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]\]

where \(z = \frac{y - \mu}{\sigma}\), and \(\Phi\), \(\phi\) are the standard Normal CDF and PDF respectively.

Parameters:
  • true (TypeAliasType) – Ground-truth values.

  • loc (TypeAliasType) – Predicted means (\(\mu\)).

  • scale (TypeAliasType) – Predicted standard deviations (\(\sigma > 0\)).

  • axis (int | None) – Axis along which to compute the mean CRPS. If None (default), the mean is taken over the entire array.

Return type:

ndarray

Returns:

Mean CRPS as a NumPy array (scalar when axis=None). Lower is better.

Raises:

ValueError – If any scale value is non-positive.

References

Gneiting, T., Raftery, A. E., Westveld, A. H., & Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5), 1098–1118.

Examples

>>> import numpy as np
>>> normal_crps(np.array([1.0, 2.0]), np.array([1.0, 2.0]), np.array([1.0, 1.0]))
array(0.5641...)   # σ / sqrt(π) for perfect calibration
twiga.core.metrics.parametric.laplace_nll(true, loc, scale, axis=None)#

Negative log-likelihood under a Laplace predictive distribution.

The negative log-likelihood (NLL) is:

\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \log(2b_t) + \frac{|y_t - \mu_t|}{b_t} \right]\]

where \(b_t\) is the scale parameter of the Laplace distribution.

Parameters:
  • true (TypeAliasType) – Ground-truth values.

  • loc (TypeAliasType) – Predicted means (\(\mu\)).

  • scale (TypeAliasType) – Predicted Laplace scale (\(b > 0\)).

  • axis (int | None) – Axis along which to compute the mean NLL. If None (default), the mean is taken over the entire array.

Return type:

ndarray

Returns:

Mean NLL as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

Examples

>>> import numpy as np
>>> laplace_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5]))
array(1.1931...)
twiga.core.metrics.stats.diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate', h=1)#

One-sided Diebold-Mariano (DM) test for equal predictive accuracy.

Tests the null hypothesis H0 that forecast p_pred_1 has equal or lower loss than forecast p_pred_2 against the one-sided alternative H1 that p_pred_2 is strictly more accurate. Rejecting H0 (small p-value) means p_pred_2 is significantly more accurate.

The loss differential is \(d_t = g(e_{1,t}) - g(e_{2,t})\) where \(g(e) = |e|^{norm}\) (norm=1 → MAE-based, norm=2 → MSE-based). The test statistic

\[DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}} \xrightarrow{d} \mathcal{N}(0,1)\]

uses a Bartlett HAC variance estimator with bandwidth h - 1 to account for serial correlation in multi-step forecasts.

Parameters:
  • p_real (ndarray) – Observed values, shape (n_days, n_steps) or (T,).

  • p_pred_1 (ndarray) – Forecast 1 predictions, same shape as p_real.

  • p_pred_2 (ndarray) – Forecast 2 predictions, same shape as p_real.

  • norm (int) – Loss function norm. 1 for MAE-based, 2 for MSE-based. Defaults to 1.

  • version (str) –

    Test variant.

    • "multivariate": single test on the mean loss differential averaged across steps (default).

    • "univariate": independent test per time step (column), returns an array of p-values of length n_steps.

  • h (int) – Forecast horizon used to set the HAC bandwidth. Set to the number of steps ahead you are forecasting. Defaults to 1.

Return type:

float | ndarray

Returns:

One-sided p-value (float) for "multivariate", or 1-D array of p-values (one per step) for "univariate".

Raises:

References

Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> real = rng.normal(0, 1, (100, 24))
>>> pred1 = real + rng.normal(0, 0.5, real.shape)
>>> pred2 = real + rng.normal(0, 0.3, real.shape)  # better forecast
>>> p = diebold_mariano_test(real, pred1, pred2, version="multivariate")
twiga.core.metrics.stats.giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate')#

One-sided Giacomini-White (GW) test for Conditional Predictive Accuracy (CPA).

Tests the null hypothesis H0 that forecast p_pred_1 has equal or lower conditional expected loss than p_pred_2 against the one-sided alternative H1 that p_pred_2 is conditionally more accurate. Rejecting H0 (small p-value) means p_pred_2 is significantly more accurate given the available information set.

Unlike DM, the GW test conditions on lagged loss differentials, making it more powerful when one model systematically exploits information that the other misses. The instruments are \(h_t = [1, d_{t-1}]\), and the test statistic is \(T \cdot R^2 \sim \chi^2(q)\) where \(q\) is the number of instruments.

Parameters:
  • p_real (ndarray) – Observed values, shape (n_days, n_steps) or (T,).

  • p_pred_1 (ndarray) – Forecast 1 predictions, same shape as p_real.

  • p_pred_2 (ndarray) – Forecast 2 predictions, same shape as p_real.

  • norm (int) – Loss function norm. 1 for MAE-based, 2 for MSE-based. Defaults to 1.

  • version (str) –

    Test variant.

    • "multivariate": single test on the mean loss differential averaged across steps (default).

    • "univariate": independent test per time step, returns an array of p-values of length n_steps.

Return type:

float | ndarray

Returns:

One-sided p-value (float) for "multivariate", or 1-D array of p-values (one per step) for "univariate".

Raises:
  • ValueError – If input shapes do not match or are 1-D (use (T, 1) for a single-step series).

  • ValueError – If version is not "univariate" or "multivariate".

  • ValueError – If norm is not 1 or 2.

References

Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545–1578.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> real = rng.normal(0, 1, (100, 24))
>>> pred1 = real + rng.normal(0, 0.5, real.shape)
>>> pred2 = real + rng.normal(0, 0.3, real.shape)  # better forecast
>>> p = giacomini_white_test(real, pred1, pred2, version="multivariate")