Evaluation Metrics#
Source Files
twiga/core/metrics/point.pytwiga/core/metrics/interval.pytwiga/core/metrics/quantile.pytwiga/core/metrics/parametric.pytwiga/core/metrics/stats.py
Twiga provides a comprehensive set of metrics for evaluating point forecasts, prediction intervals, quantile forecasts, and full probabilistic forecasts. All metrics operate on NumPy arrays and are accessible via evaluate_forecast().
Point Forecast Metrics#
Defined in twiga/core/metrics/point.py. Used via evaluate_point_forecast().
Metric |
Key |
Formula |
Description |
|---|---|---|---|
Mean Error |
|
\(\frac{1}{T}\sum(y_t - \hat{y}_t)\) |
Average signed error (bias indicator) |
MAE |
|
\(\frac{1}{T}\sum|y_t - \hat{y}_t|\) |
Mean Absolute Error |
MedAE |
|
\(\text{median}(|y_t - \hat{y}_t|)\) |
Median Absolute Error — robust to outliers |
MSE |
|
\(\frac{1}{T}\sum(y_t - \hat{y}_t)^2\) |
Mean Squared Error |
RMSE |
|
\(\sqrt{MSE}\) |
Root Mean Squared Error |
RMSLE |
|
\(\sqrt{\frac{1}{T}\sum(\log(1+\hat{y}_t)-\log(1+y_t))^2}\) |
Root Mean Squared Log Error — scale-invariant, requires \(y,\hat{y}\geq 0\) |
NRMSE |
|
\(\frac{RMSE}{\text{scale}}\) |
Normalized RMSE (scale: max, range, or mean) |
MAPE |
|
\(\frac{100}{T}\sum\frac{|y_t - \hat{y}_t|}{|y_t|}\) |
Mean Absolute Percentage Error |
sMAPE |
|
\(\frac{200}{T}\sum\frac{|y_t - \hat{y}_t|}{|y_t| + |\hat{y}_t|}\) |
Symmetric MAPE |
WMAPE |
|
\(100 \times \frac{\sum|y_t - \hat{y}_t|}{\sum|y_t|}\) |
Weighted MAPE |
OPE |
|
\(100 \times \frac{|\sum y_t - \sum \hat{y}_t|}{\sum y_t}\) |
Overall Percentage Error |
MASE |
|
\(\frac{MAE}{MAE_{\text{naive}}}\) |
Mean Absolute Scaled Error (naive = lag-1) |
MSSE |
|
\(\frac{MSE}{MSE_{\text{naive}}}\) |
Mean Squared Scaled Error |
RMSSE |
|
\(\frac{RMSE}{RMSE_{\text{naive}}}\) |
Root Mean Squared Scaled Error |
R² |
|
\(1 - \frac{\sum(y_t-\hat{y}_t)^2}{\sum(y_t-\bar{y})^2}\) |
Coefficient of Determination — proportion of variance explained |
Correlation |
|
\(\rho(y, \hat{y})\) |
Pearson correlation coefficient |
Normalized Bias |
|
\(\sum\frac{y_t - \hat{y}_t}{y_t + \hat{y}_t}\) |
Normalized bias error |
Bold entries are new additions aligned with standard libraries (NeuralForecast, Darts, scikit-learn).
Skill Score#
The Skill Score measures relative improvement over a naive baseline:
It is a standalone function (not part of get_pointwise_metrics) because it requires a baseline forecast:
from twiga.core.metrics import skill_score
ss = skill_score(y, y_hat, y_naive=seasonal_naive_forecast, metric="mae")
# ss > 0: model beats the baseline; ss = 0: no improvement; ss < 0: worse
Using Point Metrics#
from twiga.core.metrics.point import evaluate_point_forecast
metrics_df = evaluate_point_forecast(result, metric_names=["mae", "rmse", "r2", "smape", "corr"])
Available metric keys: "res", "mae", "mase", "mse", "msse", "rmse", "rmsse", "mape", "wmape", "smape", "ope", "corr", "nbias", "r2", "rmsle", "medae".
If metric_names is None, all available metrics are computed.
Interval Forecast Metrics#
Defined in twiga/core/metrics/interval.py. Used via evaluate_interval_forecast().
Metric |
Key |
Formula |
Description |
|---|---|---|---|
Winkler Score |
|
\(\frac{1}{n}\!\sum_t\!\left[(U_t\!-\!L_t) + \frac{2}{\alpha}(L_t\!-\!y_t)\mathbf{1}_{y_t < L_t} + \frac{2}{\alpha}(y_t\!-\!U_t)\mathbf{1}_{y_t > U_t}\right]\) |
Penalises both interval width and missed coverage |
Coverage (PICP) |
|
\(\frac{1}{n}\sum_t \mathbf{1}[L_t \leq y_t \leq U_t]\) |
Proportion of true values within prediction intervals |
ACE |
|
\(PICP - (1 - \alpha)\) |
Average Coverage Error: signed deviation from the nominal level |
NMPI |
|
\(\frac{\text{median}(U_t - L_t)}{\text{scale}}\) |
Normalised Median Prediction Interval width |
CWE |
|
\(\lambda \cdot \frac{(1+\beta^2)\,\gamma_{\text{picp}}\,\gamma_{\text{nmpi}}}{\beta^2\,\gamma_{\text{nmpi}} + \gamma_{\text{picp}}}\) |
Coverage–Width–Error: harmonic mean of coverage and width scores, weighted by forecast accuracy |
γ_picp |
— |
\(\min\!\left(\frac{\text{PICP}}{1-\alpha},\, 1\right)\) |
Coverage adequacy: linear ramp from 0 to 1 as PICP reaches the target; saturates at 1 for over-coverage |
γ_nmpi |
— |
\(\min\!\left(\left(\frac{\kappa/R}{\text{NMPI}}\right)^k,\, 1\right)\) |
Width efficiency: full credit when NMPI ≤ reference; inverse penalty for excess width |
MSIS |
|
\(\frac{\frac{1}{n}\sum_t\!\left[(U_t\!-\!L_t) + \frac{2}{\alpha}(L_t\!-\!y_t)\mathbf{1}_{y_t < L_t} + \frac{2}{\alpha}(y_t\!-\!U_t)\mathbf{1}_{y_t > U_t}\right]}{\frac{1}{T-s}\sum_{i=s+1}^{T}|y_i - y_{i-s}|}\) |
Mean Scaled Interval Score — M4/M5 competition standard |
MSIS#
The Mean Scaled Interval Score is the standard metric for prediction interval comparison in the M4/M5 competitions. It penalizes width and missed coverage, normalized by the naive seasonal forecast MAE:
from twiga.core.metrics import msis
score = msis(true, lower, upper, y_train=training_series, alpha=0.1, seasonality=24)
CWE — Coverage–Width–Error#
CWE is the primary interval quality metric in Twiga. It uses the F-β harmonic mean of two sub-scores, weighted by point-forecast accuracy λ:
where \(\lambda = 1 - \text{error}\) (e.g. \(1 - \text{NRMSE}\)).
Sub-scores:
Linear ramp from 0 at PICP = 0 to 1 at PICP = 1 − α. Over-coverage (PICP > 1 − α) does not increase this score — it is captured instead by γ_nmpi penalising the wider intervals that caused it.
Asymmetric width score. Full credit (1) for any NMPI ≤ κ/R (sharp intervals are not penalised here). For NMPI > κ/R the score falls as the inverse ratio raised to power k — doubling the width halves the score (k = 1 default). Increase k for stricter penalisation of excess width.
Harmonic mean handles the tension naturally: a poor score on either dimension pulls CWE toward zero regardless of the other. An over-wide interval (high NMPI) is penalised even when PICP is perfect.
Reference κ/R: computed from the target distribution as spread ÷ scale. Spread defaults to IQR (robust); scale defaults to range (max − min). Both are configurable:
from twiga.core.metrics.interval import get_interval_metrics
# Default: IQR spread, range scale — best for skewed/heavy-tailed energy data
df = get_interval_metrics(true, lower, upper, alpha=0.1)
# MAD spread (maximum robustness for bimodal data, e.g. solar with nights)
df = get_interval_metrics(true, lower, upper, alpha=0.1, spread="mad")
# std spread (only for approximately Gaussian targets)
df = get_interval_metrics(true, lower, upper, alpha=0.1, spread="std")
# Custom scale denominator
df = get_interval_metrics(true, lower, upper, alpha=0.1, nmpi_scale="mean")
# Pass your own κ directly (e.g. domain knowledge on expected width)
df = get_interval_metrics(true, lower, upper, alpha=0.1, true_nmpi=15.0)
|
κ formula |
Best for |
|---|---|---|
|
q₇₅ − q₂₅ |
General; robust to outliers and heavy tails |
|
median(‖y − median(y)‖) |
Bimodal or heavily skewed targets |
|
σ |
Approximately Gaussian targets only |
|
R formula |
Best for |
|---|---|---|
|
max − min |
General; translation-invariant |
|
max |
Non-negative data where min ≈ 0 (equivalent to range) |
|
mean |
When NMPI should read as fraction of mean load |
|
median |
Robust alternative to mean for skewed data |
Using Interval Metrics#
from twiga.core.metrics.interval import evaluate_interval_forecast
metrics_df = evaluate_interval_forecast(result, alpha=0.1)
# Returns: timestamp | target | res | mae | ... | winkler_score | picp | ace | nmpi | cwe
# Interval-metric tuning with MAD spread
metrics_df = evaluate_interval_forecast(result, alpha=0.1, spread="mad", nmpi_scale="range")
Note
evaluate_interval_forecast() computes both point metrics and interval metrics for each forecast day. The alpha parameter should match the significance level used during conformal calibration. CWE is always in [0, 1] — higher is better.
Quantile Forecast Metrics#
Defined in twiga/core/metrics/quantile.py. Used via evaluate_quantile_forecast().
Metric |
Key |
Formula |
Description |
|---|---|---|---|
Pinball Loss |
|
\(\frac{1}{KT}\sum_{\tau,t}\rho_\tau(y_t, \hat{q}_{\tau,t})\) where \(\rho_\tau(y,q)=\max(\tau(y-q),\,(\tau-1)(y-q))\) |
Mean pinball (check) loss across all quantile levels |
WQL |
|
$\frac{2\sum_\tau\sum_t\rho_\tau(y_t, \hat{q}_{\tau,t})}{ |
\mathcal{T} |
CRPS |
|
\(\int_{-\infty}^{\infty}\!\!(F(z) - \mathbf{1}[z \geq y])^2\,dz\) (piecewise-linear \(F\)) |
CRPS approximated from quantile CDF |
Calibration Error |
|
\(\frac{1}{K}\sum_\tau|\hat{F}(\tau) - \tau|\) |
Mean absolute deviation of empirical coverage from nominal level |
Sharpness |
|
$\frac{1}{ |
\mathcal{P} |
PIT |
— |
histogram of \(\hat{F}(y_t)\) |
Probability Integral Transform — uniform shape indicates calibration |
Weighted Quantile Loss (WQL)#
WQL is the GluonTS/M5 standard aggregate quantile metric, normalized by total observation volume for cross-series comparability:
from twiga.core.metrics import wql
score = wql(true, quantile_preds, quantile_levels=[0.1, 0.5, 0.9], quantile_axis=0)
Using Quantile Metrics#
from twiga.core.metrics.quantile import evaluate_quantile_forecast
metrics_df = evaluate_quantile_forecast(result, alpha=0.1)
# Columns: point metrics + picp | ace | nmpi | cwe | pinball | wql | crps | calibration_error | sharpness
Probabilistic Metrics#
Defined in twiga/core/metrics/parametric.py. Used via evaluate_parametric_forecast().
Metric |
Key |
Formula |
Description |
|---|---|---|---|
CRPS |
|
\(\mathbb{E}_F|X - y| - \tfrac{1}{2}\mathbb{E}_F|X - X'|\) |
Sample-based CRPS — proper scoring rule generalising MAE to distributions |
CRPS (PWM) |
|
(PWM estimator of the same quantity) |
CRPS via Probability Weighted Moments — more stable for small sample counts |
DSS |
|
\(\frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2\log\sigma_t\) |
Dawid-Sebastiani Score — proper rule jointly rewarding calibration and sharpness |
Energy Score |
|
\(\mathbb{E}_F|\mathbf{X}-\mathbf{y}|_2 - \tfrac{1}{2}\mathbb{E}_F|\mathbf{X}-\mathbf{X}'|_2\) |
Multivariate generalisation of CRPS for joint forecast distributions |
Brier Score |
|
\(\frac{1}{n}\sum_t(\hat{p}_t - o_t)^2\) |
Probabilistic binary event forecast accuracy |
NLL |
|
\(-\frac{1}{n}\sum_t\log f(y_t\mid\theta_t)\) |
Negative log-likelihood under the predictive distribution |
Dawid-Sebastiani Score (DSS)#
DSS is a proper scoring rule for location-scale parametric forecasts, jointly rewarding sharpness and calibration:
from twiga.core.metrics.parametric import dawid_sebastiani_score
dss = dawid_sebastiani_score(true, loc=predicted_means, scale=predicted_stds)
Energy Score#
The Energy Score generalises CRPS to multivariate (joint horizon) forecast distributions:
from twiga.core.metrics.parametric import energy_score
# samples shape: (n_ensemble_members, forecast_horizon)
es = energy_score(true=true_vector, samples=ensemble_samples)
Brier Score#
Measures accuracy of probabilistic binary event forecasts (e.g., exceedance probability):
from twiga.core.metrics.parametric import brier_score
bs = brier_score(true, samples=ensemble_samples, threshold=peak_demand)
Using Probabilistic Metrics#
from twiga.core.metrics.parametric import evaluate_parametric_forecast
# Sample-based forecasts (CRPS + Energy Score)
metrics_df = evaluate_parametric_forecast(samples_result)
# Parametric forecasts (NLL + DSS)
metrics_df = evaluate_parametric_forecast(parametric_result)
Or via the low-level helper for direct array inputs:
from twiga.core.metrics.parametric import get_probabilistic_metrics
# Parametric mode (loc + scale)
metrics_df = get_probabilistic_metrics(true, loc=mu, scale=sigma)
# Sample-based mode
metrics_df = get_probabilistic_metrics(true, samples=ensemble_samples)
Analytical Parametric Scoring#
Defined in twiga/core/metrics/parametric.py. Closed-form implementations for common parametric distributions — no Monte Carlo sampling needed.
Metric |
Key |
Distribution |
Formula |
Description |
|---|---|---|---|---|
Normal NLL |
|
\(\mathcal{N}(\mu,\sigma^2)\) |
\(\frac{1}{n}\sum_t\left[\frac{1}{2}\log(2\pi\sigma_t^2) + \frac{(y_t-\mu_t)^2}{2\sigma_t^2}\right]\) |
Negative log-likelihood under a Normal predictive distribution |
Normal CRPS |
|
\(\mathcal{N}(\mu,\sigma^2)\) |
\(\sigma\!\left[z(2\Phi(z)-1) + 2\phi(z) - \tfrac{1}{\sqrt{\pi}}\right],\; z=\tfrac{y-\mu}{\sigma}\) |
Exact analytical CRPS — no quantile approximation needed |
Laplace NLL |
|
\(\text{Laplace}(\mu,b)\) |
\(\frac{1}{n}\sum_t\left[\log(2b_t) + \frac{|y_t-\mu_t|}{b_t}\right]\) |
Negative log-likelihood under a Laplace predictive distribution |
from twiga.core.metrics.parametric import normal_nll, normal_crps, laplace_nll
nll = normal_nll(true, loc=mu, scale=sigma)
crps = normal_crps(true, loc=mu, scale=sigma) # exact closed form
The analytical normal_crps uses:
Statistical Significance Testing#
Defined in twiga/core/metrics/stats.py.
Both tests share the same interface: they accept raw forecast arrays (p_real, p_pred_1, p_pred_2) of shape (n_days, n_steps_per_day) and return a one-sided p-value. A small p-value (< 0.05) means p_pred_2 is significantly more accurate than p_pred_1.
Test |
Key |
Statistic |
Formula |
What it tests |
|---|---|---|---|---|
DM |
|
\(\mathcal{N}(0,1)\) |
\(DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}}\) |
Unconditional equal predictive accuracy |
GW |
|
\(\chi^2(q)\) |
\(T \cdot R^2\) from OLS on instruments \(h_t = [1, d_{t-1}]\) |
Conditional predictive accuracy (exploits information set) |
Diebold-Mariano (DM) Test#
Tests unconditional equal accuracy. The loss differential \(d_t = g(e_{1,t}) - g(e_{2,t})\) is tested against zero using a HAC Bartlett variance estimator to handle serial correlation in multi-step forecasts:
from twiga.core.metrics import diebold_mariano_test
# Multivariate (single test, averaging across steps)
p = diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version="multivariate", h=24)
# Univariate (one test per step, e.g. per hour in a 24h horizon)
p_per_hour = diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version="univariate", h=24)
# p_per_hour.shape == (24,)
Giacomini-White (GW) Test#
Tests conditional predictive accuracy — it checks whether one forecast systematically exploits information the other ignores. Uses instruments \(h_t = [1, d_{t-1}]\) in an OLS regression; the test statistic is \(T \cdot R^2 \sim \chi^2(q)\):
from twiga.core.metrics import giacomini_white_test
# Multivariate (single joint test)
p = giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version="multivariate")
# Univariate (per-step tests)
p_per_hour = giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version="univariate")
When to use which test:
Use DM for a straightforward comparison of two forecasts without conditioning on past information.
Use GW when you suspect one model exploits time-varying information (e.g., regime changes, calendar effects) that the other does not — GW is more powerful in these situations.
Visualising the test matrix#
Both functions return arrays when version="univariate", making it straightforward to build a model comparison heatmap:
import pandas as pd, numpy as np
from twiga.core.metrics import diebold_mariano_test
models = {"Model A": pred_a, "Model B": pred_b, "Naive": pred_naive}
p_matrix = pd.DataFrame(index=models, columns=models, dtype=float)
for m1, p1 in models.items():
for m2, p2 in models.items():
if m1 == m2:
p_matrix.loc[m1, m2] = 1.0
else:
p_matrix.loc[m1, m2] = diebold_mariano_test(p_real, p1, p2, version="multivariate")
Unified Dispatch#
Use evaluate_forecast() as the single entry point — it dispatches automatically based on result.kind:
from twiga.core.metrics import evaluate_forecast
metrics_df = evaluate_forecast(result) # auto-dispatches
metrics_df = evaluate_forecast(result, alpha=0.1) # for interval/quantile kinds
Supported kind values:
|
Evaluate function |
Key metrics |
|---|---|---|
|
|
MAE, RMSE, R², MedAE, RMSLE, … |
|
|
point metrics + PICP, ACE, NMPI, MSIS, CWE |
|
|
point + interval + pinball, WQL, CRPS, sharpness |
|
|
point metrics + CRPS, Energy Score |
|
|
point metrics + NLL, DSS |
API Reference#
- twiga.core.metrics.evaluate_forecast(result, **kwargs)#
Dispatch forecast evaluation to the kind-appropriate function.
This is the single entry point for evaluation so callers do not need to know or import individual evaluate functions. The correct function is selected from an internal dispatch table keyed on
kind.- Parameters:
result (
ForecastResult) – AForecastResultwhoseground_truthfield is set. Thekindattribute determines which evaluate function is called.**kwargs (
object) – Extra keyword arguments forwarded to the underlying evaluate function (e.g.alphafor interval or quantile forecasts).
- Return type:
- Returns:
DataFrame of per-day, per-target metrics indexed by daily timestamp.
- Raises:
ValueError – If
result.ground_truthisNoneorresult.kindis not one of the supported values.
Examples
>>> result = ForecastResult(kind=ForecastKind.POINT, ground_truth=y, ...) >>> metrics_df = evaluate_forecast(result)
- twiga.core.metrics.point.evaluate_point_forecast(result, metric_names=None, axis=1)#
Evaluate point forecasts by computing daily pointwise metrics.
- Parameters:
result (
ForecastResult) –ForecastResultwithground_truthset,kind=ForecastKind.POINT.metric_names (
list[str] |None) – Metric names to compute. WhenNoneall supported point metrics are computed.axis (
int|None) – Axis along which to compute aggregate metrics. If None, metrics that require an axis will use their default behavior.
- Return type:
- Returns:
DataFrame of per-day, per-target metrics indexed by daily timestamp.
- twiga.core.metrics.point.r_squared(y, y_hat, axis=0)#
Calculate the Coefficient of Determination (R²) between actual and predicted values.
R² measures the proportion of variance in the actual values explained by the predictions: [ R^2 = 1 - frac{sum_{t=1}^{T} (y_t - hat{y}_t)^2}{sum_{t=1}^{T} (y_t - bar{y})^2} ] where \(\bar{y}\) is the mean of the actual values. A perfect forecast yields
R² = 1; a constant mean prediction yieldsR² = 0; worse-than-mean predictions can yield negative values.- Parameters:
- Return type:
- Returns:
R² coefficient, scalar if 1D input, array if multi-dimensional. Returns NaN if the variance of y is zero.
- Raises:
ValueError – If y and y_hat have incompatible shapes.
Examples
>>> import numpy as np >>> y = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) >>> y_hat = np.array([1.1, 2.0, 2.9, 4.1, 5.0]) >>> r_squared(y, y_hat) 0.994
- twiga.core.metrics.point.rmsle(y, y_hat, axis=0)#
Calculate the Root Mean Squared Log Error (RMSLE).
The RMSLE penalises under-predictions more than over-predictions and is scale-invariant on a log scale: [ RMSLE = sqrt{frac{1}{T} sum_{t=1}^{T} bigl(log(1 + hat{y}_t) - log(1 + y_t)bigr)^2} ]
- Parameters:
- Return type:
- Returns:
RMSLE, scalar if 1D input, array if multi-dimensional.
- Raises:
ValueError – If any value in y or y_hat is negative.
Examples
>>> import numpy as np >>> y = np.array([1.0, 2.0, 3.0]) >>> y_hat = np.array([1.1, 1.9, 3.2]) >>> rmsle(y, y_hat) 0.07663...
- twiga.core.metrics.point.medae(y, y_hat, axis=0)#
Calculate the Median Absolute Error (MedAE) between actual and predicted values.
MedAE is robust to outliers, unlike MAE: [ MedAE = text{median}!left(lvert y_t - hat{y}_t rvertright) ]
- Parameters:
- Return type:
- Returns:
Median absolute error, scalar if 1D input, array if multi-dimensional.
- Raises:
ValueError – If y and y_hat have incompatible shapes.
Examples
>>> import numpy as np >>> y = np.array([1.0, 2.0, 3.0, 100.0]) >>> y_hat = np.array([1.1, 2.0, 2.9, 90.0]) >>> medae(y, y_hat) 0.1
- twiga.core.metrics.point.skill_score(y, y_hat, y_naive, metric='mae', axis=0)#
Calculate the Skill Score of a forecast relative to a naive baseline.
The Skill Score quantifies the relative improvement of a model over a naive baseline forecast: [ SS = 1 - frac{text{metric}(y, hat{y})}{text{metric}(y, y_{text{naive}})} ] A positive score indicates improvement over the baseline; 0 means no improvement; negative values indicate the model is worse than the baseline.
- Parameters:
y (
TypeAliasType) – Actual values, shape (n_samples,).y_hat (
TypeAliasType) – Model predictions, same shape as y.y_naive (
TypeAliasType) – Naive baseline predictions (e.g. seasonal naïve, persistence), same shape as y.metric (
str) – Error metric to use for comparison. One of'mae','mse','rmse'. Defaults to'mae'.axis (
int|None) – Axis along which to compute the metric. If None, flatten arrays. Defaults to 0.
- Return type:
- Returns:
Skill score(s). Scalar if 1D input or axis=None, otherwise array.
- Raises:
ValueError – If metric is not one of the supported values.
ValueError – If the naive baseline metric is zero.
Examples
>>> import numpy as np >>> y = np.array([1.0, 2.0, 3.0, 4.0]) >>> y_hat = np.array([1.1, 2.1, 3.0, 4.0]) >>> y_naive = np.array([1.0, 1.0, 1.0, 1.0]) >>> skill_score(y, y_hat, y_naive, metric="mae") 0.9333...
- twiga.core.metrics.interval.evaluate_interval_forecast(result, alpha=0.01, true_nmpi=None, spread='iqr', nmpi_scale='range', axis=1, metric_names=None)#
Evaluate interval forecasts by computing daily point and interval metrics.
- Parameters:
result (
ForecastResult) –ForecastResultwithground_truth,lower, andupperset,kind=ForecastKind.INTERVAL.alpha (
float) – Significance level used for Winkler score and coverage computations. Must be in(0, 1). Defaults to0.01.true_nmpi (
float|None) – Override for κ — absolute spread of the target used as the CWE reference numerator. WhenNone, derived fromspread.spread (
Literal['iqr','mad','std']) – Spread measure for the CWE reference κ."iqr"(default),"mad", or"std". Seeget_interval_metrics().nmpi_scale (
Literal['range','max','mean','median']) – Denominator R for NMPI and κ/R."range"(default),"max","mean", or"median".axis (
int|None) – Axis along which to compute aggregate metrics.metric_names (
list[str] |None) – List of interval metric names to compute.
- Return type:
- Returns:
DataFrame of per-day, per-target point and interval metrics indexed by daily timestamp.
- twiga.core.metrics.interval.msis(true, lower, upper, y_train, alpha, seasonality=1, axis=None)#
Calculate the Mean Scaled Interval Score (MSIS).
[ text{MSIS} = frac{text{Mean}(W_t)}{text{NaiveMAE}_{text{seasonal}}} ]
- Parameters:
- Return type:
- Returns:
MSIS (scalar or array).
- twiga.core.metrics.quantile.evaluate_quantile_forecast(result, alpha=0.01, true_nmpi=None, metric_names=None, quantile_axis=1, axis=1)#
Evaluate quantile forecasts by computing daily point, interval, and quantile metrics.
This function loops over days and targets (using
_evaluate_loop) and aggregates point metrics (fromget_pointwise_metrics()), interval metrics (fromget_interval_metrics()), and quantile metrics (fromget_quantile_metrics()) into a single DataFrame.- Parameters:
result (
ForecastResult) –ForecastResultwithground_truth,quantiles, andquantile_levelsset, andkind=ForecastKind.QUANTILE.alpha (
float) – Significance level for interval metrics; the outermost quantile pair (first and last) is used as lower/upper bounds. Defaults to0.01.true_nmpi (
float|None) – Optional reference value for NMPI normalisation. WhenNonethe standard deviation of the true values is used.metric_names (
list[str] |None) – Point metric names to compute. IfNone, all supported point metrics are computed.quantile_axis (
int) – Axis ofquantile_predsthat holds the quantiles. IfNone, it is inferred (first axis if its length matches, otherwise second).axis (
int) – Axis or axes passed toget_quantile_metrics()(aggregation over the forecast horizon). Default is0.
- Return type:
- Returns:
DataFrame of per‑day, per‑target metrics indexed by daily timestamp. Columns include point metrics, interval metrics, and quantile metrics.
- twiga.core.metrics.quantile.wql(true, quantile_preds, quantile_levels, quantile_axis=None, axis=1, double_factor=True, normalize_by_volume=True)#
Weighted Quantile Loss (WQL), also known as the scaled pinball loss.
WQL is the GluonTS / M5 standard aggregate quantile metric:
\[\text{WQL} = \frac{2 \mathbf{1}_{\text{double}} \sum_{\tau,t} \rho_\tau(y_t, \hat{q}_{\tau,t})} {|\mathcal{T}| \cdot \sum_t |y_t| \mathbf{1}_{\text{normalize}}}\]where \(\rho_\tau\) is the pinball loss. The factor \(2\) is often included to make the loss at \(\tau=0.5\) equal to the MAE.
- Parameters:
true (
ndarray) – Ground truth values, any shape.quantile_preds (
ndarray) – Predicted quantiles. Must have one axis holding the quantiles; the remaining dimensions must matchtrue.shapeafter axis movement.quantile_levels (
ndarray|list[float] |tuple[float,...]) – Quantile levels \(\tau \in (0,1)\), strictly increasing.quantile_axis (
int|None) – Axis ofquantile_predsthat holds the quantiles. IfNone, it is inferred (first axis if its length matches, otherwise second).axis (
int) – Axis or axes along which to aggregate the WQL. The total volume (sum of \(|y|\)) is computed over this axis, and the pinball loss is summed over the same axis. IfNone, aggregate over all dimensions and return a scalar.double_factor (
bool) – Whether to multiply by \(2\) (defaultTrue).normalize_by_volume (
bool) – Whether to divide by the total absolute volume \(\sum_t |y_t|\) (defaultTrue).
- Return type:
- Returns:
If axis is
None, a scalar WQL. Otherwise, an array with the same shape astrueafter removing the dimensions specified by axis.- Raises:
ValueError – If inputs are empty, levels out of range, shape mismatch, or total observation volume is zero when normalize_by_volume is
True.
References
Alexandrov et al. (2020). GluonTS: Probabilistic and neural time series modeling in Python. Journal of Machine Learning Research, 21(116), 1–6.
- twiga.core.metrics.parametric.evaluate_parametric_forecast(result, metric_names=None)#
Evaluate parametric forecasts by computing daily point and NLL metrics.
The NLL is computed under a Normal distribution assumption when
result.log_likelihoodis absent. Pass pre-computed log-likelihood values via that attribute to override this behaviour.- Parameters:
result (
ForecastResult) –ForecastResultwithground_truthandscaleset,kind=ForecastKind.PARAMETRIC.metric_names (
list[str] |None) – List of probabilistic metric names to compute.
- Return type:
- Returns:
DataFrame of per-day, per-target point metrics and NLL indexed by daily timestamp.
- Raises:
ValueError – If any scale value is non-positive.
- twiga.core.metrics.parametric.get_probabilistic_metrics(true, loc=None, scale=None, samples=None, log_likelihood=None, metric_names=None, axis=1)#
Compute probabilistic and pointwise metrics for forecasts.
Supports two modes: - Parametric (Normal distribution): provide loc and scale. - Sample‑based (ensemble): provide samples.
If loc is provided, pointwise metrics (MAE, RMSE, etc.) are also computed via
get_pointwise_metrics().- Parameters:
true (
TypeAliasType) – Ground‑truth values, any shape.loc (
TypeAliasType|None) – Point predictions (e.g., mean of the predictive distribution). Required for parametric mode; if given, pointwise metrics are added.scale (
TypeAliasType|None) – Predicted standard deviations (parametric Normal). Must be >0. Required for parametric mode.samples (
TypeAliasType|None) – Ensemble/posterior samples, shape(n_samples, *true.shape). Required for sample‑based mode.log_likelihood (
TypeAliasType|None) – Pre‑computed log‑likelihood values (optional). If provided, overrides NLL computation in parametric mode.metric_names (
list[str] |None) – List of probabilistic metric names to compute. IfNone, all available metrics for the chosen mode are computed. Supported names: - Parametric:'nll','crps','dss'- Sample‑based:'crps','energy_score','crps_pwm','brier_score'axis (
int|None) – Axis along which to aggregate the metrics (e.g., over the forecast horizon). IfNone, aggregate over all dimensions (scalar result). Default is0.
- Return type:
- Returns:
Single‑row
DataFramewith columns – - All pointwise metrics (if loc was provided) - Selected probabilistic metrics- Raises:
ValueError – If neither scale nor samples is provided, or if required inputs are missing for a mode, or if scale contains non‑positive values.
- twiga.core.metrics.parametric.dawid_sebastiani_score(true, loc, scale, axis=None)#
Compute the Dawid-Sebastiani Score (DSS) for parametric Normal forecasts.
The Dawid-Sebastiani Score under a Normal distribution is:
\[\text{DSS}(y_t, \mu_t, \sigma_t) = \frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2 \log \sigma_t\]Lower values indicate better forecasts.
- Parameters:
- Return type:
- Returns:
Mean DSS as a NumPy array (scalar when
axis=None).- Raises:
ValueError – If any
scalevalue is non-positive.
References
Dawid & Sebastiani (1999). Coherent dispersion criteria for optimal experimental design. The Annals of Statistics.
- twiga.core.metrics.parametric.energy_score(true, samples, axis=None)#
Compute the Energy Score (ES) for ensemble forecasts.
The Energy Score is the multivariate generalization of CRPS:
\[\text{ES}(F, \mathbf{y}) = \mathbb{E}_F\|\mathbf{X} - \mathbf{y}\|_2 - \tfrac{1}{2} \mathbb{E}_F\|\mathbf{X} - \mathbf{X}'\|_2\]- Parameters:
- Return type:
- Returns:
Energy Score as a NumPy array (scalar when
axis=None). Lower is better.
References
Gneiting & Raftery (2007). Strictly proper scoring rules. JASA.
- twiga.core.metrics.parametric.brier_score(true, samples, threshold, axis=None)#
Compute the Brier Score for a probabilistic binary event forecast.
The Brier Score is:
\[\text{BS} = \frac{1}{n} \sum_{t=1}^{n} \left( \hat{p}_t - o_t \right)^2\]where \(\hat{p}_t\) is the predicted probability that \(y_t \leq\) threshold.
- Parameters:
- Return type:
- Returns:
Brier Score as a NumPy array (scalar when
axis=None). Lower is better.
References
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.
- twiga.core.metrics.parametric.normal_nll(true, loc, scale, axis=None)#
Negative log-likelihood under a Normal predictive distribution.
The negative log-likelihood (NLL) is:
\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \frac{1}{2} \log(2\pi \sigma_t^2) + \frac{(y_t - \mu_t)^2}{2\sigma_t^2} \right]\]- Parameters:
- Return type:
- Returns:
Negative log-likelihood as a NumPy array (scalar when
axis=None).- Raises:
ValueError – If any
scalevalue is non-positive.
Examples
>>> import numpy as np >>> normal_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5])) array(1.0439...)
- twiga.core.metrics.parametric.normal_crps(true, loc, scale, axis=None)#
Analytical Continuous Ranked Probability Score (CRPS) for a Normal predictive distribution.
The closed-form CRPS under \(\mathcal{N}(\mu, \sigma^2)\) is:
\[\text{CRPS}(\mathcal{N}(\mu, \sigma^2), y) = \sigma \left[ z \left( 2\Phi(z) - 1 \right) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]\]where \(z = \frac{y - \mu}{\sigma}\), and \(\Phi\), \(\phi\) are the standard Normal CDF and PDF respectively.
- Parameters:
- Return type:
- Returns:
Mean CRPS as a NumPy array (scalar when
axis=None). Lower is better.- Raises:
ValueError – If any
scalevalue is non-positive.
References
Gneiting, T., Raftery, A. E., Westveld, A. H., & Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5), 1098–1118.
Examples
>>> import numpy as np >>> normal_crps(np.array([1.0, 2.0]), np.array([1.0, 2.0]), np.array([1.0, 1.0])) array(0.5641...) # σ / sqrt(π) for perfect calibration
- twiga.core.metrics.parametric.laplace_nll(true, loc, scale, axis=None)#
Negative log-likelihood under a Laplace predictive distribution.
The negative log-likelihood (NLL) is:
\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \log(2b_t) + \frac{|y_t - \mu_t|}{b_t} \right]\]where \(b_t\) is the scale parameter of the Laplace distribution.
- Parameters:
- Return type:
- Returns:
Mean NLL as a NumPy array (scalar when
axis=None).- Raises:
ValueError – If any
scalevalue is non-positive.
Examples
>>> import numpy as np >>> laplace_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5])) array(1.1931...)
- twiga.core.metrics.stats.diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate', h=1)#
One-sided Diebold-Mariano (DM) test for equal predictive accuracy.
Tests the null hypothesis H0 that forecast
p_pred_1has equal or lower loss than forecastp_pred_2against the one-sided alternative H1 thatp_pred_2is strictly more accurate. Rejecting H0 (small p-value) meansp_pred_2is significantly more accurate.The loss differential is \(d_t = g(e_{1,t}) - g(e_{2,t})\) where \(g(e) = |e|^{norm}\) (
norm=1→ MAE-based,norm=2→ MSE-based). The test statistic\[DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}} \xrightarrow{d} \mathcal{N}(0,1)\]uses a Bartlett HAC variance estimator with bandwidth
h - 1to account for serial correlation in multi-step forecasts.- Parameters:
p_real (
ndarray) – Observed values, shape(n_days, n_steps)or(T,).p_pred_1 (
ndarray) – Forecast 1 predictions, same shape asp_real.p_pred_2 (
ndarray) – Forecast 2 predictions, same shape asp_real.norm (
int) – Loss function norm.1for MAE-based,2for MSE-based. Defaults to1.version (
str) –Test variant.
"multivariate": single test on the mean loss differential averaged across steps (default)."univariate": independent test per time step (column), returns an array of p-values of lengthn_steps.
h (
int) – Forecast horizon used to set the HAC bandwidth. Set to the number of steps ahead you are forecasting. Defaults to1.
- Return type:
- Returns:
One-sided p-value (float) for
"multivariate", or 1-D array of p-values (one per step) for"univariate".- Raises:
ValueError – If input shapes do not match.
ValueError – If
versionis not"univariate"or"multivariate".ValueError – If
normis not 1 or 2.
References
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263.
Examples
>>> import numpy as np >>> rng = np.random.default_rng(0) >>> real = rng.normal(0, 1, (100, 24)) >>> pred1 = real + rng.normal(0, 0.5, real.shape) >>> pred2 = real + rng.normal(0, 0.3, real.shape) # better forecast >>> p = diebold_mariano_test(real, pred1, pred2, version="multivariate")
- twiga.core.metrics.stats.giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate')#
One-sided Giacomini-White (GW) test for Conditional Predictive Accuracy (CPA).
Tests the null hypothesis H0 that forecast
p_pred_1has equal or lower conditional expected loss thanp_pred_2against the one-sided alternative H1 thatp_pred_2is conditionally more accurate. Rejecting H0 (small p-value) meansp_pred_2is significantly more accurate given the available information set.Unlike DM, the GW test conditions on lagged loss differentials, making it more powerful when one model systematically exploits information that the other misses. The instruments are \(h_t = [1, d_{t-1}]\), and the test statistic is \(T \cdot R^2 \sim \chi^2(q)\) where \(q\) is the number of instruments.
- Parameters:
p_real (
ndarray) – Observed values, shape(n_days, n_steps)or(T,).p_pred_1 (
ndarray) – Forecast 1 predictions, same shape asp_real.p_pred_2 (
ndarray) – Forecast 2 predictions, same shape asp_real.norm (
int) – Loss function norm.1for MAE-based,2for MSE-based. Defaults to1.version (
str) –Test variant.
"multivariate": single test on the mean loss differential averaged across steps (default)."univariate": independent test per time step, returns an array of p-values of lengthn_steps.
- Return type:
- Returns:
One-sided p-value (float) for
"multivariate", or 1-D array of p-values (one per step) for"univariate".- Raises:
ValueError – If input shapes do not match or are 1-D (use
(T, 1)for a single-step series).ValueError – If
versionis not"univariate"or"multivariate".ValueError – If
normis not 1 or 2.
References
Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545–1578.
Examples
>>> import numpy as np >>> rng = np.random.default_rng(0) >>> real = rng.normal(0, 1, (100, 24)) >>> pred1 = real + rng.normal(0, 0.5, real.shape) >>> pred2 = real + rng.normal(0, 0.3, real.shape) # better forecast >>> p = giacomini_white_test(real, pred1, pred2, version="multivariate")