Evaluation Metrics #

Returns:

DataFrame of per-day, per-target metrics indexed by daily timestamp.

Raises:

ValueError – If result.ground_truth is None or result.kind is not one of the supported values.

Examples

>>> result = ForecastResult(kind=ForecastKind.POINT, ground_truth=y, ...)
>>> metrics_df = evaluate_forecast(result)

twiga.core.metrics.point.evaluate_point_forecast(result, metric_names=None, axis=1)#

Evaluate point forecasts by computing daily pointwise metrics.

Parameters:

result (ForecastResult) – ForecastResult with ground_truth set, kind=ForecastKind.POINT.
metric_names (list[str] | None) – Metric names to compute. When None all supported point metrics are computed.
axis (int | None) – Axis along which to compute aggregate metrics. If None, metrics that require an axis will use their default behavior.

Return type:

Returns:

DataFrame of per-day, per-target metrics indexed by daily timestamp.

twiga.core.metrics.point.r_squared(y, y_hat, axis=0)#

Calculate the Coefficient of Determination (R²) between actual and predicted values.

R² measures the proportion of variance in the actual values explained by the predictions: [ R^2 = 1 - frac{sum_{t=1}^{T} (y_t - hat{y}_t)^2}{sum_{t=1}^{T} (y_t - bar{y})^2} ] where $\bar{y}$ is the mean of the actual values. A perfect forecast yields R² = 1; a constant mean prediction yields R² = 0; worse-than-mean predictions can yield negative values.

Parameters:

y (TypeAliasType) – Actual values, shape (n_samples,) or (n_samples, n_features).
y_hat (TypeAliasType) – Predicted values, same shape as y.
axis (int | None) – Axis along which to compute R². If None, flatten arrays. Defaults to 0.

Return type:

Returns:

R² coefficient, scalar if 1D input, array if multi-dimensional. Returns NaN if the variance of y is zero.

Raises:

ValueError – If y and y_hat have incompatible shapes.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
>>> y_hat = np.array([1.1, 2.0, 2.9, 4.1, 5.0])
>>> r_squared(y, y_hat)
0.994

twiga.core.metrics.point.rmsle(y, y_hat, axis=0)#

Calculate the Root Mean Squared Log Error (RMSLE).

The RMSLE penalises under-predictions more than over-predictions and is scale-invariant on a log scale: [ RMSLE = sqrt{frac{1}{T} sum_{t=1}^{T} bigl(log(1 + hat{y}_t) - log(1 + y_t)bigr)^2} ]

Parameters:

y (TypeAliasType) – Actual values, must be non-negative, shape (n_samples,) or (n_samples, n_features).
y_hat (TypeAliasType) – Predicted values, must be non-negative, same shape as y.
axis (int | None) – Axis along which to compute the mean. If None, flatten arrays. Defaults to 0.

Return type:

Returns:

RMSLE, scalar if 1D input, array if multi-dimensional.

Raises:

ValueError – If any value in y or y_hat is negative.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0])
>>> y_hat = np.array([1.1, 1.9, 3.2])
>>> rmsle(y, y_hat)
0.07663...

twiga.core.metrics.point.medae(y, y_hat, axis=0)#

Calculate the Median Absolute Error (MedAE) between actual and predicted values.

MedAE is robust to outliers, unlike MAE: [ MedAE = text{median}!left(lvert y_t - hat{y}_t rvertright) ]

Parameters:

y (TypeAliasType) – Actual values, shape (n_samples,) or (n_samples, n_features).
y_hat (TypeAliasType) – Predicted values, same shape as y.
axis (int | None) – Axis along which to compute the median. If None, flatten arrays. Defaults to 0.

Return type:

Returns:

Median absolute error, scalar if 1D input, array if multi-dimensional.

Raises:

ValueError – If y and y_hat have incompatible shapes.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 100.0])
>>> y_hat = np.array([1.1, 2.0, 2.9, 90.0])
>>> medae(y, y_hat)
0.1

twiga.core.metrics.point.skill_score(y, y_hat, y_naive, metric='mae', axis=0)#

Calculate the Skill Score of a forecast relative to a naive baseline.

The Skill Score quantifies the relative improvement of a model over a naive baseline forecast: [ SS = 1 - frac{text{metric}(y, hat{y})}{text{metric}(y, y_{text{naive}})} ] A positive score indicates improvement over the baseline; 0 means no improvement; negative values indicate the model is worse than the baseline.

Parameters:

y (TypeAliasType) – Actual values, shape (n_samples,).
y_hat (TypeAliasType) – Model predictions, same shape as y.
y_naive (TypeAliasType) – Naive baseline predictions (e.g. seasonal naïve, persistence), same shape as y.
metric (str) – Error metric to use for comparison. One of 'mae', 'mse', 'rmse'. Defaults to 'mae'.
axis (int | None) – Axis along which to compute the metric. If None, flatten arrays. Defaults to 0.

Return type:

Returns:

Skill score(s). Scalar if 1D input or axis=None, otherwise array.

Raises:

ValueError – If metric is not one of the supported values.
ValueError – If the naive baseline metric is zero.

Examples

>>> import numpy as np
>>> y = np.array([1.0, 2.0, 3.0, 4.0])
>>> y_hat = np.array([1.1, 2.1, 3.0, 4.0])
>>> y_naive = np.array([1.0, 1.0, 1.0, 1.0])
>>> skill_score(y, y_hat, y_naive, metric="mae")
0.9333...

twiga.core.metrics.interval.evaluate_interval_forecast(result, alpha=0.01, true_nmpi=None, spread='iqr', nmpi_scale='range', axis=1, metric_names=None)#

Evaluate interval forecasts by computing daily point and interval metrics.

Parameters:

result (ForecastResult) – ForecastResult with ground_truth, lower, and upper set, kind=ForecastKind.INTERVAL.
alpha (float) – Significance level used for Winkler score and coverage computations. Must be in (0, 1). Defaults to 0.01.
true_nmpi (float | None) – Override for κ — absolute spread of the target used as the CWE reference numerator. When None, derived from spread.
spread (Literal['iqr', 'mad', 'std']) – Spread measure for the CWE reference κ. "iqr" (default), "mad", or "std". See get_interval_metrics().
nmpi_scale (Literal['range', 'max', 'mean', 'median']) – Denominator R for NMPI and κ/R. "range" (default), "max", "mean", or "median".
axis (int | None) – Axis along which to compute aggregate metrics.
metric_names (list[str] | None) – List of interval metric names to compute.

Return type:

Returns:

DataFrame of per-day, per-target point and interval metrics indexed by daily timestamp.

twiga.core.metrics.interval.msis(true, lower, upper, y_train, alpha, seasonality=1, axis=None)#

Calculate the Mean Scaled Interval Score (MSIS).

[ text{MSIS} = frac{text{Mean}(W_t)}{text{NaiveMAE}_{text{seasonal}}} ]

Parameters:

true (TypeAliasType) – True values for forecast period.
lower (TypeAliasType) – Lower bounds.
upper (TypeAliasType) – Upper bounds.
y_train (TypeAliasType) – Historical training data.
alpha (float) – Significance level.
seasonality (int) – Seasonal period.
axis (int | None) – Axis for Winkler score averaging.

Return type:

Returns:

MSIS (scalar or array).

twiga.core.metrics.quantile.evaluate_quantile_forecast(result, alpha=0.01, true_nmpi=None, metric_names=None, quantile_axis=1, axis=1)#

Evaluate quantile forecasts by computing daily point, interval, and quantile metrics.

This function loops over days and targets (using _evaluate_loop) and aggregates point metrics (from get_pointwise_metrics()), interval metrics (from get_interval_metrics()), and quantile metrics (from get_quantile_metrics()) into a single DataFrame.

Parameters:

result (ForecastResult) – ForecastResult with ground_truth, quantiles, and quantile_levels set, and kind=ForecastKind.QUANTILE.
alpha (float) – Significance level for interval metrics; the outermost quantile pair (first and last) is used as lower/upper bounds. Defaults to 0.01.
true_nmpi (float | None) – Optional reference value for NMPI normalisation. When None the standard deviation of the true values is used.
metric_names (list[str] | None) – Point metric names to compute. If None, all supported point metrics are computed.
quantile_axis (int) – Axis of quantile_preds that holds the quantiles. If None, it is inferred (first axis if its length matches, otherwise second).
axis (int) – Axis or axes passed to get_quantile_metrics() (aggregation over the forecast horizon). Default is 0.

Return type:

Returns:

DataFrame of per‑day, per‑target metrics indexed by daily timestamp. Columns include point metrics, interval metrics, and quantile metrics.

twiga.core.metrics.quantile.wql(true, quantile_preds, quantile_levels, quantile_axis=None, axis=1, double_factor=True, normalize_by_volume=True)#

Weighted Quantile Loss (WQL), also known as the scaled pinball loss.

WQL is the GluonTS / M5 standard aggregate quantile metric:

\[\text{WQL} = \frac{2 \mathbf{1}_{\text{double}} \sum_{\tau,t} \rho_\tau(y_t, \hat{q}_{\tau,t})} {|\mathcal{T}| \cdot \sum_t |y_t| \mathbf{1}_{\text{normalize}}}\]

where $\rho_\tau$ is the pinball loss. The factor $2$ is often included to make the loss at $\tau=0.5$ equal to the MAE.

Parameters:

true (ndarray) – Ground truth values, any shape.
quantile_preds (ndarray) – Predicted quantiles. Must have one axis holding the quantiles; the remaining dimensions must match true.shape after axis movement.
quantile_levels (ndarray | list[float] | tuple[float, ...]) – Quantile levels $\tau \in (0,1)$, strictly increasing.
quantile_axis (int | None) – Axis of quantile_preds that holds the quantiles. If None, it is inferred (first axis if its length matches, otherwise second).
axis (int) – Axis or axes along which to aggregate the WQL. The total volume (sum of $|y|$) is computed over this axis, and the pinball loss is summed over the same axis. If None, aggregate over all dimensions and return a scalar.
double_factor (bool) – Whether to multiply by $2$ (default True).
normalize_by_volume (bool) – Whether to divide by the total absolute volume $\sum_t |y_t|$ (default True).

Return type:

Returns:

If axis is None, a scalar WQL. Otherwise, an array with the same shape as true after removing the dimensions specified by axis.

Raises:

ValueError – If inputs are empty, levels out of range, shape mismatch, or total observation volume is zero when normalize_by_volume is True.

References

Alexandrov et al. (2020). GluonTS: Probabilistic and neural time series modeling in Python. Journal of Machine Learning Research, 21(116), 1–6.

twiga.core.metrics.parametric.evaluate_parametric_forecast(result, metric_names=None)#

Evaluate parametric forecasts by computing daily point and NLL metrics.

The NLL is computed under a Normal distribution assumption when result.log_likelihood is absent. Pass pre-computed log-likelihood values via that attribute to override this behaviour.

Parameters:

result (ForecastResult) – ForecastResult with ground_truth and scale set, kind=ForecastKind.PARAMETRIC.
metric_names (list[str] | None) – List of probabilistic metric names to compute.

Return type:

Returns:

DataFrame of per-day, per-target point metrics and NLL indexed by daily timestamp.

Raises:

ValueError – If any scale value is non-positive.

twiga.core.metrics.parametric.get_probabilistic_metrics(true, loc=None, scale=None, samples=None, log_likelihood=None, metric_names=None, axis=1)#

Compute probabilistic and pointwise metrics for forecasts.

Supports two modes: - Parametric (Normal distribution): provide loc and scale. - Sample‑based (ensemble): provide samples.

If loc is provided, pointwise metrics (MAE, RMSE, etc.) are also computed via get_pointwise_metrics().

Parameters:

true (TypeAliasType) – Ground‑truth values, any shape.
loc (TypeAliasType | None) – Point predictions (e.g., mean of the predictive distribution). Required for parametric mode; if given, pointwise metrics are added.
scale (TypeAliasType | None) – Predicted standard deviations (parametric Normal). Must be >0. Required for parametric mode.
samples (TypeAliasType | None) – Ensemble/posterior samples, shape (n_samples, *true.shape). Required for sample‑based mode.
log_likelihood (TypeAliasType | None) – Pre‑computed log‑likelihood values (optional). If provided, overrides NLL computation in parametric mode.
metric_names (list[str] | None) – List of probabilistic metric names to compute. If None, all available metrics for the chosen mode are computed. Supported names: - Parametric: 'nll', 'crps', 'dss' - Sample‑based: 'crps', 'energy_score', 'crps_pwm', 'brier_score'
axis (int | None) – Axis along which to aggregate the metrics (e.g., over the forecast horizon). If None, aggregate over all dimensions (scalar result). Default is 0.

Return type:

Returns:

Single‑row DataFrame with columns – - All pointwise metrics (if loc was provided) - Selected probabilistic metrics

Raises:

ValueError – If neither scale nor samples is provided, or if required inputs are missing for a mode, or if scale contains non‑positive values.

twiga.core.metrics.parametric.dawid_sebastiani_score(true, loc, scale, axis=None)#

Compute the Dawid-Sebastiani Score (DSS) for parametric Normal forecasts.

The Dawid-Sebastiani Score under a Normal distribution is:

\[\text{DSS}(y_t, \mu_t, \sigma_t) = \frac{(y_t - \mu_t)^2}{\sigma_t^2} + 2 \log \sigma_t\]

Lower values indicate better forecasts.

Parameters:

true (TypeAliasType) – Ground-truth values.
loc (TypeAliasType) – Predicted means.
scale (TypeAliasType) – Predicted standard deviations ($\sigma > 0$).
axis (int | None) – Axis along which to compute the mean DSS. If None (default), the mean is taken over the entire array.

Return type:

Returns:

Mean DSS as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

References

Dawid & Sebastiani (1999). Coherent dispersion criteria for optimal experimental design. The Annals of Statistics.

twiga.core.metrics.parametric.energy_score(true, samples, axis=None)#

Compute the Energy Score (ES) for ensemble forecasts.

The Energy Score is the multivariate generalization of CRPS:

\[\text{ES}(F, \mathbf{y}) = \mathbb{E}_F\|\mathbf{X} - \mathbf{y}\|_2 - \tfrac{1}{2} \mathbb{E}_F\|\mathbf{X} - \mathbf{X}'\|_2\]

Parameters:

true (TypeAliasType) – Ground-truth values, shape (n_obs,) or higher.
samples (TypeAliasType) – Ensemble samples, shape (n_samples, n_obs) or compatible.
axis (int | None) – Axis along which to average (usually the observation axis). If None (default), computes globally.

Return type:

Returns:

Energy Score as a NumPy array (scalar when axis=None). Lower is better.

References

Gneiting & Raftery (2007). Strictly proper scoring rules. JASA.

twiga.core.metrics.parametric.brier_score(true, samples, threshold, axis=None)#

Compute the Brier Score for a probabilistic binary event forecast.

The Brier Score is:

\[\text{BS} = \frac{1}{n} \sum_{t=1}^{n} \left( \hat{p}_t - o_t \right)^2\]

where $\hat{p}_t$ is the predicted probability that $y_t \leq$ threshold.

Parameters:

true (TypeAliasType) – Ground-truth values.
samples (TypeAliasType) – Ensemble samples, shape (n_samples, n_obs).
threshold (float) – Event threshold.
axis (int | None) – Axis along which to compute the mean Brier Score. If None (default), the mean is taken over all observations.

Return type:

Returns:

Brier Score as a NumPy array (scalar when axis=None). Lower is better.

References

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.

twiga.core.metrics.parametric.normal_nll(true, loc, scale, axis=None)#

Negative log-likelihood under a Normal predictive distribution.

The negative log-likelihood (NLL) is:

\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \frac{1}{2} \log(2\pi \sigma_t^2) + \frac{(y_t - \mu_t)^2}{2\sigma_t^2} \right]\]

Parameters:

true (TypeAliasType) – Ground-truth values.
loc (TypeAliasType) – Predicted means ($\mu$).
scale (TypeAliasType) – Predicted standard deviations ($\sigma > 0$).
axis (int | None) – Axis along which to compute the mean. If None (default), the mean is taken over the entire array.

Return type:

Returns:

Negative log-likelihood as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

Examples

>>> import numpy as np
>>> normal_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5]))
array(1.0439...)

twiga.core.metrics.parametric.normal_crps(true, loc, scale, axis=None)#

Analytical Continuous Ranked Probability Score (CRPS) for a Normal predictive distribution.

The closed-form CRPS under $\mathcal{N}(\mu, \sigma^2)$ is:

\[\text{CRPS}(\mathcal{N}(\mu, \sigma^2), y) = \sigma \left[ z \left( 2\Phi(z) - 1 \right) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]\]

where $z = \frac{y - \mu}{\sigma}$, and $\Phi$, $\phi$ are the standard Normal CDF and PDF respectively.

Parameters:

true (TypeAliasType) – Ground-truth values.
loc (TypeAliasType) – Predicted means ($\mu$).
scale (TypeAliasType) – Predicted standard deviations ($\sigma > 0$).
axis (int | None) – Axis along which to compute the mean CRPS. If None (default), the mean is taken over the entire array.

Return type:

Returns:

Mean CRPS as a NumPy array (scalar when axis=None). Lower is better.

Raises:

ValueError – If any scale value is non-positive.

References

Gneiting, T., Raftery, A. E., Westveld, A. H., & Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5), 1098–1118.

Examples

>>> import numpy as np
>>> normal_crps(np.array([1.0, 2.0]), np.array([1.0, 2.0]), np.array([1.0, 1.0]))
array(0.5641...)   # σ / sqrt(π) for perfect calibration

twiga.core.metrics.parametric.laplace_nll(true, loc, scale, axis=None)#

Negative log-likelihood under a Laplace predictive distribution.

The negative log-likelihood (NLL) is:

\[\text{NLL} = \frac{1}{n} \sum_{t=1}^{n} \left[ \log(2b_t) + \frac{|y_t - \mu_t|}{b_t} \right]\]

where $b_t$ is the scale parameter of the Laplace distribution.

Parameters:

true (TypeAliasType) – Ground-truth values.
loc (TypeAliasType) – Predicted means ($\mu$).
scale (TypeAliasType) – Predicted Laplace scale ($b > 0$).
axis (int | None) – Axis along which to compute the mean NLL. If None (default), the mean is taken over the entire array.

Return type:

Returns:

Mean NLL as a NumPy array (scalar when axis=None).

Raises:

ValueError – If any scale value is non-positive.

Examples

>>> import numpy as np
>>> laplace_nll(np.array([1.0, 2.0]), np.array([1.1, 1.9]), np.array([0.5, 0.5]))
array(1.1931...)

twiga.core.metrics.stats.diebold_mariano_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate', h=1)#

One-sided Diebold-Mariano (DM) test for equal predictive accuracy.

Tests the null hypothesis H0 that forecast p_pred_1 has equal or lower loss than forecast p_pred_2 against the one-sided alternative H1 that p_pred_2 is strictly more accurate. Rejecting H0 (small p-value) means p_pred_2 is significantly more accurate.

The loss differential is $d_t = g(e_{1,t}) - g(e_{2,t})$ where $g(e) = |e|^{norm}$ (norm=1 → MAE-based, norm=2 → MSE-based). The test statistic

\[DM = \frac{\bar{d}}{\sqrt{\hat{\sigma}^2_{\text{HAC}} / T}} \xrightarrow{d} \mathcal{N}(0,1)\]

uses a Bartlett HAC variance estimator with bandwidth h - 1 to account for serial correlation in multi-step forecasts.

Parameters:

p_real (ndarray) – Observed values, shape (n_days, n_steps) or (T,).
p_pred_1 (ndarray) – Forecast 1 predictions, same shape as p_real.
p_pred_2 (ndarray) – Forecast 2 predictions, same shape as p_real.
norm (int) – Loss function norm. 1 for MAE-based, 2 for MSE-based. Defaults to 1.
version (str) –
Test variant.
- "multivariate": single test on the mean loss differential averaged across steps (default).
- "univariate": independent test per time step (column), returns an array of p-values of length n_steps.
h (int) – Forecast horizon used to set the HAC bandwidth. Set to the number of steps ahead you are forecasting. Defaults to 1.

Return type:

Returns:

One-sided p-value (float) for "multivariate", or 1-D array of p-values (one per step) for "univariate".

Raises:

ValueError – If input shapes do not match.
ValueError – If version is not "univariate" or "multivariate".
ValueError – If norm is not 1 or 2.

References

Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> real = rng.normal(0, 1, (100, 24))
>>> pred1 = real + rng.normal(0, 0.5, real.shape)
>>> pred2 = real + rng.normal(0, 0.3, real.shape)  # better forecast
>>> p = diebold_mariano_test(real, pred1, pred2, version="multivariate")

twiga.core.metrics.stats.giacomini_white_test(p_real, p_pred_1, p_pred_2, norm=1, version='multivariate')#

One-sided Giacomini-White (GW) test for Conditional Predictive Accuracy (CPA).

Tests the null hypothesis H0 that forecast p_pred_1 has equal or lower conditional expected loss than p_pred_2 against the one-sided alternative H1 that p_pred_2 is conditionally more accurate. Rejecting H0 (small p-value) means p_pred_2 is significantly more accurate given the available information set.

Unlike DM, the GW test conditions on lagged loss differentials, making it more powerful when one model systematically exploits information that the other misses. The instruments are $h_t = [1, d_{t-1}]$, and the test statistic is $T \cdot R^2 \sim \chi^2(q)$ where $q$ is the number of instruments.

Parameters:

p_real (ndarray) – Observed values, shape (n_days, n_steps) or (T,).
p_pred_1 (ndarray) – Forecast 1 predictions, same shape as p_real.
p_pred_2 (ndarray) – Forecast 2 predictions, same shape as p_real.
norm (int) – Loss function norm. 1 for MAE-based, 2 for MSE-based. Defaults to 1.
version (str) –
Test variant.
- "multivariate": single test on the mean loss differential averaged across steps (default).
- "univariate": independent test per time step, returns an array of p-values of length n_steps.

Return type: