Feature Analysis & Statistical Toolkit#

Source Files
  • twiga/core/data/relevance.py - AssociationAnalyzer class

  • twiga/core/data/selection.py - select_top_features, compute_rf_importance

  • twiga/core/stats/association.py - rank_correlation, compute_anova_f, compute_chi2

  • twiga/core/stats/autocorr.py - get_pacf_values, estimate_ar_order, calculate_durbin_watson

  • twiga/core/stats/entropy.py - entropy & complexity measures

  • twiga/core/stats/mutual_information.py - compute_mutual_information

  • twiga/core/stats/ppscore.py - compute_ppscore

  • twiga/core/stats/residual.py - cooks_distance, leverage_statistic, get_standardized_residual

  • twiga/core/stats/seasonality.py - get_acf_values, get_trend_seasonality

  • twiga/core/stats/stationarity.py - adf_test, kpss_test

  • twiga/core/stats/utils.py - get_optimal_lags, normalise_series, get_entropy_parameters

  • twiga/core/stats/xicorr.py - compute_xicorr

Twiga provides a comprehensive statistical analysis toolkit for understanding and preparing time series data before modelling. All functions are accessible through a flat import API:

from twiga.core.stats import rank_correlation, adf_test, get_acf_values, compute_xicorr
from twiga.core.data import AssociationAnalyzer, select_top_features, compute_rf_importance

Association Analysis#

AssociationAnalyzer#

The AssociationAnalyzer class provides a single interface to compute feature-target association using eight methods spanning linear, rank-based, information-theoretic, and model-based approaches. All methods return a tidy DataFrame with columns feature, target, and a score column, making results easy to visualise and compare.

from twiga.core.data import AssociationAnalyzer

analyzer = AssociationAnalyzer(
    data=df,
    features=["temperature", "ghi", "hour_sin", "hour_cos", "wind_speed"],
    target="load_mw",
    task="regression",   # "regression" or "classification"
)

# Single method
pearson_df = analyzer.compute(method="pearson")

# All eight methods at once
all_df = analyzer.compute_all()

# Heatmap of scores
fig = analyzer.plot_heatmap(method="spearman", top_k=20)

Supported Methods#

Method

Key

Strength

Pearson correlation

"pearson"

Fast, interpretable for linear relationships

Spearman rank correlation

"spearman"

Robust to outliers, monotonic relationships

Kendall tau

"kendall"

Small-sample stability, concordance-based

Xi-correlation (Chatterjee)

"xicor"

Detects arbitrary functional relationships

Predictive Power Score

"pps"

Asymmetric, model-based, captures non-linearity

Mutual Information

"mi"

Non-parametric, captures any statistical dependency

ANOVA F-score

"anova"

Regression - linear effect of continuous feature on target

Chi-squared

"chi2"

Classification - independence test for categorical/binned features

Choosing a method

For energy forecasting with continuous features, Spearman is a safe default (robust, interpretable). Add xi-correlation or PPS when you suspect non-monotonic relationships (e.g. solar generation vs. temperature in high-irradiance regimes). Use MI to catch interactions that linear methods miss.


rank_correlation#

Computes pairwise rank-based correlation (Pearson, Spearman, or Kendall) between features and a target, returning a sorted DataFrame.

from twiga.core.stats import rank_correlation

scores = rank_correlation(
    data=df,
    target="load_mw",
    features=["temperature", "ghi", "wind_speed"],
    method="spearman",   # "pearson" | "spearman" | "kendall"
    threshold=0.2,       # optional: drop features below this absolute score
)
# Returns DataFrame with columns: feature, target, correlation  -  sorted descending

compute_anova_f#

Computes the ANOVA F-score between continuous features and a target for regression tasks.

from twiga.core.stats import compute_anova_f

scores = compute_anova_f(
    data=df,
    target="load_mw",
    features=feature_columns,
    task="regression",
    normalize=True,  # normalise scores to [0, 1]
)

compute_chi2#

Computes Chi-squared statistics between (non-negative) features and a categorical target. Useful as a fast filter before applying heavier methods.

from twiga.core.stats import compute_chi2

scores = compute_chi2(
    data=df,
    target="category",
    features=feature_columns,
)
# Returns DataFrame with columns: feature, target, chi2, p_value

compute_xicorr#

Computes Chatterjee’s Xi-correlation - a non-parametric measure that is zero for independent variables and tends to one for functional (deterministic) relationships. Unlike Pearson/Spearman, it detects non-monotonic dependencies.

from twiga.core.stats import compute_xicorr

scores = compute_xicorr(
    data=df,
    target="load_mw",
    features=feature_columns,
)
# Returns DataFrame sorted descending by xi score

compute_mutual_information#

Estimates mutual information between features and a target using k-NN density estimation. Works for both regression and classification.

from twiga.core.stats import compute_mutual_information

scores = compute_mutual_information(
    data=df,
    target="load_mw",
    features=feature_columns,
    task="regression",
    threshold=0.01,   # optional: drop near-zero MI scores
    pivot=False,      # True → wide matrix; False → long DataFrame
)

compute_ppscore#

Computes the Predictive Power Score (PPS) - an asymmetric, normalised score in [0, 1] that measures how well a feature predicts a target using a decision tree. A score of 0 means no predictive value; 1 means perfect prediction.

from twiga.core.stats import compute_ppscore

scores = compute_ppscore(
    data=df,
    target="load_mw",
    features=feature_columns,
    catch_errors=True,  # return error dict instead of raising on failure
)
# Returns a dict: {feature_name: {"ppscore": float, "model_score": float, ...}}

Feature Selection#

select_top_features#

Ensemble feature ranking that combines six association signals into a single ranked list using Borda count (or other aggregation strategies). Robust to the noise and bias of any individual metric.

from twiga.core.data.selection import select_top_features

top_features = select_top_features(
    data=df,
    features=feature_columns,
    target="load_mw",
    task="regression",
    top_k=15,           # return the top-k ranked features
    alpha=0.05,         # optional p-value threshold (spearman/anova)
    return_scores=True, # include per-method score columns in output
    rank_method="borda_rank",  # "borda_rank" | "geom_rank" | "arith_rank" | "med_rank" | "sum_rank"
    include_chi2=False, # set True for classification tasks
)

The six signals combined:

Signal

Method

Rank correlation

Spearman (regression) / ANOVA (classification)

Mutual information

k-NN density estimation

Xi-correlation

Chatterjee’s ξ

Predictive Power Score

Decision-tree PPS

RF feature importance

Mean Decrease in Impurity

Chi-squared (optional)

χ² independence test

compute_rf_importance#

Computes Random Forest mean-decrease-in-impurity (MDI) importance as a long-format DataFrame.

from twiga.core.data.selection import compute_rf_importance

importance_df = compute_rf_importance(
    data=df,
    features=feature_columns,
    target="load_mw",
    task="regression",   # "regression" or "classification"
)
# Returns DataFrame with columns: feature, importance (sums to ~1.0)

Time Series Diagnostics#

Stationarity Tests#

from twiga.core.stats import adf_test, kpss_test

# Augmented Dickey-Fuller — H0: unit root present (non-stationary)
# Returns the raw statsmodels result tuple: (stat, p_value, n_lags, n_obs, critical_values)
adf_result = adf_test(series, maxlag=None, regression="c", autolag="BIC")
stat, p_val, n_lags, n_obs, crit = adf_result
print(f"ADF stat={stat:.4f}  p={p_val:.4f}")
# p < 0.05 → reject H0 → evidence of stationarity

# KPSS — H0: series IS stationary (level or trend)
# Returns: (stat, p_value, n_lags, critical_values)
kpss_result = kpss_test(series, regression="c", maxlag=None)
stat, p_val, n_lags, crit = kpss_result
print(f"KPSS stat={stat:.4f}  p={p_val:.4f}")
# p < 0.05 → reject H0 → evidence of non-stationarity

Both functions validate input length (≥ 10 observations), compute sensible lag bounds automatically, and suppress statsmodels InterpolationWarning by default.

Combining ADF and KPSS

Use both tests together for a stronger conclusion:

ADF

KPSS

Conclusion

Reject H0 (p < 0.05)

Fail to reject H0 (p ≥ 0.05)

Stationary

Fail to reject H0

Reject H0

Non-stationary (unit root)

Reject H0

Reject H0

Trend-stationary or structural break

Fail to reject H0

Fail to reject H0

Inconclusive — more data needed

Seasonal Decomposition (MSTL)#

get_trend_seasonality uses MSTL (Multiple Seasonal-Trend decomposition using Loess) to separate a time series into trend, one or more seasonal components, and a residual. It handles multiple simultaneous seasonalities (e.g. daily + weekly in a 30-minute series).

from twiga.core.stats import get_trend_seasonality

# Single seasonality — daily cycle at 30-min resolution (48 samples/day)
result = get_trend_seasonality(
    series,
    periods=[48],
    component_names=["Daily"],
)
# result.columns → ["Daily", "Trend", "Residual"]

# Multiple seasonalities — daily + weekly
result = get_trend_seasonality(
    series,
    periods=[48, 336],          # 336 = 48 * 7
    component_names=["Daily", "Weekly"],
)
# result.columns → ["Daily", "Weekly", "Trend", "Residual"]

The returned DataFrame has the same DatetimeIndex as the input series.

Period sizing

statsmodels silently drops any period that exceeds half the series length. Twiga logs a warning when this happens and still returns the remaining components, so decompositions never fail silently.

Tuning parameters:

Parameter

Default

Effect

iterate

3

Outer MSTL iterations — more = smoother but slower

seasonal_deg

0

LOESS degree for seasonal smoother (0 = constant, 1 = linear)

inner_iter

2

STL inner loop iterations

outer_iter

0

STL robustness iterations (> 0 downweights outliers)

ACF / PACF#

from twiga.core.stats import get_acf_values, get_pacf_values, estimate_ar_order

# Autocorrelation function
acf_df = get_acf_values(series, max_lag=48, alpha=0.05)
# Returns DataFrame: lag, acf, lower_ci, upper_ci

# Partial autocorrelation function
pacf_df = get_pacf_values(series, max_lag=48, alpha=0.05)
# Returns DataFrame: lag, pacf, lower_ci, upper_ci

# Suggest AR order from PACF
p = estimate_ar_order(series, max_lag=48)

Durbin-Watson#

from twiga.core.stats import calculate_durbin_watson

dw = calculate_durbin_watson(residuals)
# Returns float in [0, 4]; ~2 means no autocorrelation

Entropy & Complexity Measures#

The entropy module characterises time series complexity, which is useful for diagnosing predictability and selecting model types.

from twiga.core.stats import (
    get_permutation_entropy,
    get_approx_entropy,
    get_sample_entropy,
    get_hurst_exponent,
    get_dfa_exponent,
    get_entropy_parameters,
)

# Permutation entropy  -  fast, robust to noise
pe = get_permutation_entropy(series, order=3, delay=1, normalize=True)

# Approximate entropy  -  regularity measure
ae = get_approx_entropy(series, order=2, metric="chebyshev")

# Sample entropy  -  less biased than ApEn for short series
se = get_sample_entropy(series, order=2, metric="chebyshev")

# Hurst exponent  -  long-range dependence (0.5 = random walk, >0.5 = trending)
h = get_hurst_exponent(series)

# Detrended Fluctuation Analysis exponent
dfa = get_dfa_exponent(series)

# Auto-select appropriate entropy parameters for a given series
params = get_entropy_parameters(series)

Measure

Range

Interpretation

Permutation entropy

[0, 1]

0 = fully regular, 1 = maximally complex

Approximate entropy

≥ 0

Higher = less regular

Sample entropy

≥ 0

Higher = less regular, less biased than ApEn

Hurst exponent

(0, 1)

< 0.5 anti-persistent, 0.5 random, > 0.5 persistent

DFA exponent

≥ 0

Similar to Hurst; robust to non-stationarity


Residual Diagnostics#

After model fitting, residual diagnostics help identify influential observations and model misspecification.

from twiga.core.stats import (
    cooks_distance,
    leverage_statistic,
    get_standardized_residual,
)

# Cook's distance  -  influence of each observation on fitted coefficients
cd = cooks_distance(fitted_values, residuals, n_params)
# Returns np.ndarray; values > 4/n are typically flagged

# Leverage  -  how far each point is from the design-matrix centroid
lev = leverage_statistic(X)
# Returns np.ndarray of hat-matrix diagonals

# Standardised residuals
std_resid = get_standardized_residual(residuals)

Utilities#

from twiga.core.stats import normalise_series, get_optimal_lags, prepare_data

# Normalise a series to [0, 1] or standardise to zero mean / unit variance
norm = normalise_series(series, method="minmax")   # "minmax" | "zscore"

# Estimate optimal lag order (AIC/BIC information criterion)
lag = get_optimal_lags(series, max_lag=48, criterion="aic")

# Prepare a feature matrix and target vector from a DataFrame
X, y = prepare_data(df, target="load_mw", features=feature_columns)

API Reference#

Feature selection & association#

class twiga.core.data.relevance.AssociationAnalyzer(method='spearman', task='regression', **kwargs)#

Bases: object

A unified interface for calculating and visualizing feature associations.

static compute(data, target_col, variable_cols, method='spearman', task='regression', **kwargs)#

Calculate association between target and features using the chosen method.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target_col (str) – The dependent variable.

  • variable_cols (list[str]) – List of independent variables.

  • method (Literal['pearson', 'spearman', 'kendall', 'xicor', 'pps', 'mi', 'anova', 'chi2']) – The statistical method to use.

  • task (Literal['regression', 'classification']) – Task type (regression/classification) for PPS, MI, and ANOVA.

  • **kwargs (Any) – Method-specific args: - pps: sample, cross_validation, random_seed - xicor: ties - mi: random_state, n_neighbors - rank: N/A (standard pandas implementation)

Return type:

DataFrame

Returns:

DataFrame with columns [‘target’, ‘feature’, ‘score’] and optionally [‘p_value’].

static plot_heatmap(assoc_df, score_col='score', target_col_name='target', feature_col_name='feature')#

Visualize association results using a heatmap.

Parameters:
  • assoc_df (DataFrame) – DataFrame returned by .compute().

  • score_col (str) – Column name to use for the heatmap values.

  • target_col_name (str) – Name of the column representing the Y axis (target).

  • feature_col_name (str) – Name of the column representing the X axis (feature).

twiga.core.data.selection.select_top_features(data, features, target, task='regression', top_k=5, alpha=0.05, rank_aggregation='borda_count', include_pps=True, include_xi=True, include_rf=True, include_chi2=True, random_state=42, return_scores=False)#

Selects top-k features using a robust ensemble of statistical metrics.

Combines linear, non-linear, and model-based relevance scores. Note: Alpha filtering applies only to ANOVA and Chi-square p-values.

Return type:

list[str] | tuple[list[str], DataFrame]

twiga.core.data.selection.compute_rf_importance(data, target, features, task='regression', n_estimators=50, random_state=42, n_jobs=-1, pivot=True)#

Compute Random Forest feature importance using SelectFromModel.

Uses mean impurity decrease (MDI) across trees, selected via a mean-importance threshold internally. MDI importance is a model-based measure - unlike statistical tests it captures non-linear interactions automatically.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target (str) – Target column.

  • features (str | list[str]) – Feature column(s).

  • task (str) – “regression” or “classification”.

  • n_estimators (int) – Number of trees in the forest. Defaults to 50.

  • random_state (int | None) – Random seed. Defaults to 42.

  • n_jobs (int) – Number of jobs for parallelisation. Defaults to -1 (all cores).

  • pivot (bool) – If True, pivot result with features as index. Defaults to True.

Return type:

DataFrame

Returns:

DataFrame with [“feature”, “rf_importance”] in long format, or pivoted with features as index.

Example

>>> df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
>>> compute_rf_importance(df, "y", ["x1", "x2"], task="classification")
twiga.core.stats.association.rank_correlation(data, target_col, variable_cols, method='spearman')#

Compute correlation between target and each variable.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target_col (str) – Name of target column.

  • variable_cols (list[str]) – List of feature column names.

  • method (str) – ‘pearson’, ‘kendall’, or ‘spearman’.

Return type:

DataFrame

Returns:

DataFrame with columns [‘target’, ‘feature’, ‘correlation’] sorted by absolute correlation descending.

twiga.core.stats.association.compute_anova_f(data, target, features, task='regression', normalize=False, pivot=False)#

Compute ANOVA F-scores and p-values for features.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target (str) – Target column name.

  • features (str | list[str]) – Feature column(s).

  • task (str) – “regression” (f_regression) or “classification” (f_classif).

  • normalize (bool) – If True, scale F-scores to [0, 1].

  • pivot (bool) – If True, return with features as index and target as column.

Return type:

tuple[DataFrame, DataFrame]

Returns:

Tuple of (f_scores_df, p_values_df).

twiga.core.stats.association.compute_chi2(data, target, features, pivot=False)#

Compute Chi-square statistic for categorical features.

Requires non-negative feature values (e.g., counts, binary flags).

Return type:

DataFrame

twiga.core.stats.xicorr.compute_xicorr(data, target_col, variable_cols, ties='auto')#

Compute Xi correlation between target and multiple features.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target_col (str) – Name of the dependent variable (y).

  • variable_cols (list[str]) – List of independent variables (x) to test against the target.

  • ties (bool | str) – Whether to account for ties in the target variable y. ‘auto’ is recommended (detects ties automatically).

Return type:

DataFrame

Returns:

DataFrame with columns [‘col1’, ‘col2’, ‘xicor’, ‘p_value’] sorted by xicor descending.

twiga.core.stats.mutual_information.compute_mutual_information(data, target, features, task='regression', threshold=None, random_state=200, pivot=True)#

Compute mutual information between features and target.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target (str) – Target column name.

  • features (str | list[str]) – Feature column(s).

  • task (str) – “regression” or “classification”.

  • threshold (float | None) – Minimum MI score to keep.

  • random_state (int | None) – Seed for reproducibility.

  • pivot (bool) – If True, return with features as index.

Return type:

DataFrame

Returns:

DataFrame containing MI scores.

twiga.core.stats.selection.compute_mutual_information(data, target, features, task='regression', threshold=None, random_state=200, pivot=True, n_jobs=-1)#

Compute mutual information in parallel per feature.

Parameters:
  • data (DataFrame) – Input DataFrame.

  • target (str) – Target column name.

  • features (str | list[str]) – Feature column(s).

  • task (Literal['regression', 'classification']) – “regression” or “classification”.

  • threshold (float | None) – Minimum MI score to keep.

  • random_state (int | None) – Random seed.

  • pivot (bool) – If True, pivot result.

  • n_jobs (int) – Parallel jobs (-1 = all cores).

Return type:

DataFrame

Returns:

DataFrame with [“target”, “feature”, “score”] or pivoted.

Example

>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [1, 2, 3]})
>>> compute_mutual_information(df, "y", "x", task="regression")
twiga.core.stats.ppscore.compute_ppscore(df, x, y, sample=5000, cross_validation=4, random_seed=123, catch_errors=True)#

Calculates the Predictive Power Score for a single (x, y) pair.

Parameters:
  • df (DataFrame) – The input pandas DataFrame.

  • x (str) – The name of the predictor (feature) column.

  • y (str) – The name of the target column to be predicted.

  • sample (int | None) – Max rows to sample. If None, uses all rows.

  • cross_validation (int) – Number of folds for cross-validation.

  • random_seed (int) – Seed for reproducibility.

  • catch_errors (bool) – If True, returns error dict instead of raising Exception.

Return type:

dict[str, Any]

Returns:

A dictionary containing the PPS and underlying metrics.

Lag & embedding selection#

twiga.core.stats.ami.get_ami_profile(series, n_neighbors=8, max_horizons=None, random_state=42)#

Compute the horizon-specific Auto-Mutual Information (AMI) profile.

AMI(h) = I(Δx_t ; Δx_{t+h}) is estimated via the k-nearest-neighbour (kNN) estimator of Kraskov et al. (2004) as implemented in sklearn.feature_selection.mutual_info_regression(). The kNN estimator is invariant to monotone rescaling, so no standardisation of the differenced series is applied.

The profile is computed on the first-differenced series to remove level persistence and trend. The raw (level) series is not modified by this function.

Parameters:
  • series (ndarray | Series) – 1-D target series as a numpy array or pandas Series. Must be free of NaN values and have length >= n_neighbors + 2 after differencing.

  • n_neighbors (int) – Number of nearest neighbours for the kNN MI estimator. Higher values reduce variance but increase bias. Defaults to 8, following the convention of Catt (2024).

  • max_horizons (int | None) – Maximum horizon h to include in the profile. When None the maximum feasible horizon n_diff - (n_neighbors + 1) is used. Provide an explicit value to cap the profile for long series and reduce runtime. Defaults to None.

  • random_state (int) – Random seed for the kNN MI estimator. Defaults to 42.

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple (horizons, ami_values) where –

  • horizons is a 1-D integer array [1, 2, ..., max_h].

  • ami_values is a 1-D float array of non-negative AMI values (in nats) at each horizon, clipped to [0, inf).

Both arrays are empty when the differenced series is too short to compute at least one horizon.

Raises:

ValueError – If series is not 1-D after squeezing.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> signal = np.sin(2 * np.pi * np.arange(500) / 48) + 0.1 * rng.standard_normal(500)
>>> horizons, ami = get_ami_profile(signal, max_horizons=96)
>>> print(f"AMI(h=1)={ami[0]:.3f}  peak_h={horizons[ami.argmax()]}")
twiga.core.stats.utils.get_optimal_lags(series, max_lags=None, min_data_points=30, period=None, lb_test=True)#

Select optimal autoregressive lags using multiple information criteria.

The selection process evaluates fitted autoregressive models and compares:

  • AIC (Akaike Information Criterion)

  • BIC (Bayesian Information Criterion)

  • HQIC (Hannan-Quinn Information Criterion)

Optionally, Ljung-Box residual diagnostics are also computed.

Parameters:
  • series (Series | ndarray) – Time-series data.

  • max_lags (int | None) – Maximum lag count to evaluate. If None, inferred from size.

  • min_data_points (int) – Minimum required series length.

  • period (str | None) – Optional seasonal period string (for example, "D" or "H").

  • lb_test (bool) – Whether to include Ljung-Box residual diagnostics.

Return type:

dict[str, Any]

Returns:

Dictionary containing optimal lag by criterion and diagnostics.

Raises:

ValueError – If the series is too short or no model can be fitted.

twiga.core.stats.utils.get_optimal_embedding(period)#

Determine an embedding dimension from a pandas-style frequency period.

Parameters:

period (str) – Sampling period string (for example, "15min" or "1h").

Return type:

int

Returns:

Embedding dimension inferred from configured period buckets.

twiga.core.stats.utils.get_entropy_parameters(series, method, period=None, data_driven_delay=True)#

Recommend entropy embedding/tolerance parameters from theory-informed rules.

Rules implemented: - Sample/Approximate entropy: m in {2, 3}, r in [0.1*sigma, 0.25*sigma]. - Permutation entropy: m in [3, 7] with sample-size constraint n >> m!. - Optional data-driven delay from autocorrelation first crossing.

Parameters:
  • series (Series | ndarray) – Input series values.

  • method (str) – One of "sample", "approximate", or "permutation".

  • period (str | Timedelta | None) – Optional frequency hint used as a weak prior for embedding.

  • data_driven_delay (bool) – Whether to estimate delay from autocorrelation.

Return type:

dict[str, float | int]

Returns:

Parameter dictionary suitable for entropy functions.

Raises:

ValueError – If method is unsupported or series is invalid.

twiga.core.stats.utils.get_sample_per_day(frequency)#

Calculate the number of samples per day from a sampling frequency.

Supports pandas frequency strings and unit-only abbreviations like h, min, and s.

Parameters:

frequency (str | Timedelta) – Sampling frequency as a string or pandas Timedelta.

Return type:

int

Returns:

Number of samples per day for the given frequency.

Raises:

ValueError – If the frequency cannot be parsed or is invalid.

twiga.core.stats.utils.normalise_series(series)#

Normalize a series by centering and scaling to unit variance.

Parameters:

series (Series | ndarray) – Input time-series values.

Return type:

Series

Returns:

Normalized series values.

Raises:

ValueError – If the series is empty or has zero/invalid variance.

Stationarity & seasonality#

twiga.core.stats.stationarity.adf_test(values, maxlag=None, regression='c', autolag='BIC')#

Run the Augmented Dickey-Fuller (ADF) test.

Parameters:
  • values (Series | ndarray) – Input time series values.

  • maxlag (int | None) – Maximum lag order. If None, inferred from sample size.

  • regression (str) – Regression component (“c”, “ct”, “ctt”, or “n”).

  • autolag (str) – Method for automatic lag selection.

Return type:

tuple[Any, ...]

Returns:

Statsmodels ADF result tuple.

Raises:

ValueError – If input validation fails or test execution fails.

twiga.core.stats.stationarity.kpss_test(values, regression='c', maxlag=None, suppress_interpolation_warning=True)#

Run the KPSS stationarity test.

Parameters:
  • values (Series | ndarray) – Input time series values.

  • regression (str) – Regression component (“c” or “ct”).

  • maxlag (int | None) – Number of lags. If None, inferred from sample size.

  • suppress_interpolation_warning (bool) – Whether to silence boundary p-value interpolation warnings from statsmodels.

Return type:

tuple[Any, ...]

Returns:

Statsmodels KPSS result tuple.

Raises:

ValueError – If input validation fails or test execution fails.

twiga.core.stats.seasonality.get_acf_values(series, maxlag=None, alpha=0.05, seasonal_length=None, fft=True, adjusted=False, missing='none')#

Compute ACF values and confidence intervals excluding lag 0.

Parameters:
  • series (Series | ndarray) – Input time series.

  • maxlag (int | None) – Maximum lag to compute. If None, inferred from series length.

  • alpha (float) – Significance level for confidence intervals.

  • seasonal_length (int | None) – Optional seasonal period used to ensure enough lag coverage.

  • fft (bool) – Whether to compute ACF using FFT.

  • adjusted (bool) – Whether to use n-k denominator in autocovariance.

  • missing (str) – Missing-value handling mode for statsmodels ACF.

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (acf_values, acf_confint), both excluding lag 0.

Raises:

ValueError – If ACF calculation fails.

twiga.core.stats.seasonality.get_trend_seasonality(series, periods, component_names=None, iterate=3, seasonal_deg=0, inner_iter=2, outer_iter=0)#

Decompose a time series into trend, seasonal components, and residuals using MSTL.

MSTL (Multiple Seasonal-Trend decomposition using Loess) extends STL to handle multiple seasonality periods simultaneously — e.g. daily (48) and weekly (48 * 7) patterns in a 30-minute series.

Parameters:
  • series (Series) – Input time series as a pandas Series with a DatetimeIndex. Must not contain NaN values.

  • periods (list[int]) – List of seasonal period lengths in number of observations. For example, [48] for a single daily cycle at 30-minute resolution, or [48, 336] for daily + weekly.

  • component_names (list[str] | None) – Names for the seasonal components, one per period. If None, defaults to ["Seasonal_<p>" for p in periods].

  • iterate (int) – Number of outer MSTL iterations. Defaults to 3.

  • seasonal_deg (int) – Degree of the seasonal LOESS smoother (0 = constant, 1 = linear). Defaults to 0.

  • inner_iter (int) – Number of inner loop iterations in STL. Defaults to 2.

  • outer_iter (int) – Number of outer (robustness) iterations in STL. Defaults to 0.

Return type:

DataFrame

Returns:

DataFrame with the same index as series containing one column per seasonal component (named from component_names), plus "Trend" and "Residual" columns.

Raises:
  • ValueError – If len(periods) != len(component_names) when component_names is provided.

  • ValueError – If series contains NaN values.

  • ValueError – If any period is less than 2.

Examples

>>> import pandas as pd, numpy as np
>>> idx = pd.date_range("2024-01-01", periods=1000, freq="30min")
>>> ts = pd.Series(np.sin(np.linspace(0, 40 * np.pi, 1000)), index=idx)
>>> result = get_trend_seasonality(ts, periods=[48], component_names=["Daily"])
>>> result.columns.tolist()
['Daily', 'Trend', 'Residual']
twiga.core.stats.autocorr.get_pacf_values(series, maxlag, alpha=0.05, method='ywm')#

Calculate PACF values and confidence intervals.

Computes the Partial Autocorrelation Function using the Yule-Walker method and removes lag 0 (which is always 1.0) for practical use.

Parameters:
  • series – Time series data as numpy array or pandas Series.

  • maxlag – Maximum lag order to calculate.

  • alpha – Significance level for confidence intervals. Default is 0.05 for 95% confidence.

  • method – PACF estimation method. Default is ‘ywm’ (Yule-Walker). Other options include ‘ols’, ‘ywunbiased’, ‘ld’.

Returns:

Tuple of (pacf_values, pacf_confint) – - pacf_values: PACF coefficients excluding lag 0 (numpy array) - pacf_confint: Confidence intervals for each lag (numpy array of tuples)

Examples

>>> import numpy as np
>>> series = np.random.randn(100)
>>> pacf_vals, conf_int = get_pacf_values(series, maxlag=10)
>>> print(f"PACF at lag 1: {pacf_vals[0]:.4f}")
twiga.core.stats.autocorr.estimate_ar_order(series, maxlag, alpha=0.05, method='ywm')#

Estimate optimal AR model order using PACF analysis.

Implements the Box-Jenkins methodology for AR order selection by analyzing the Partial Autocorrelation Function cutoff pattern. An AR(p) process should have significant PACF values up to lag p, then cut off to zero.

Parameters:
  • series (ndarray | Series) – Time series data as numpy array or pandas Series.

  • maxlag (int) – Maximum lag order to evaluate.

  • alpha (float) – Significance level for confidence intervals. Default is 0.05 for 95% confidence bands.

  • method (str) – PACF estimation method. Default is ‘ywm’ (Yule-Walker).

Returns:

Tuple of (significant_lags, suggested_ar_order)

  • significant_lags: List of integer lag values where PACF is statistically significant (outside confidence interval)

  • suggested_ar_order: Integer AR order where PACF cuts off. Represents the last consecutive significant lag.

Examples

>>> import numpy as np
>>> # Simulate AR(2) process
>>> series = np.random.randn(200)
>>> sig_lags, ar_order = estimate_ar_order(series, maxlag=10)
>>> print(f"Suggested AR order: {ar_order}")
>>> print(f"All significant lags: {sig_lags}")

Note

If PACF shows non-consecutive significant lags (e.g., lags 1, 2, 5 are significant but 3, 4 are not), this suggests a more complex process (ARMA or seasonal) rather than pure AR.

Complexity & entropy#

twiga.core.stats.entropy.get_permutation_entropy(series, order=None, delay=1, normalize=True)#

Calculate permutation entropy for a time series.

Permutation entropy (PE) measures the complexity of a time series by analyzing the order relations between values. It captures the diversity of patterns in the data, with higher values indicating more complex and less predictable behavior.

Parameters:
  • series (ndarray | Series) – One-dimensional time series data.

  • order (int | None) – Embedding dimension (pattern length). Determines the length of sequences to analyze. Default is 3.

  • delay (int) – Time delay between elements in the sequence. Default is 1.

  • normalize (bool) – Whether to normalize the entropy value to [0, 1]. Default is True.

Return type:

float

Returns:

Permutation entropy value. Returns np.nan if calculation fails.

Examples

>>> import numpy as np
>>> regular_series = np.array([1, 2, 3, 1, 2, 3])
>>> get_permutation_entropy(regular_series)
0.123  # Low entropy - regular pattern
>>> random_series = np.random.rand(100)
>>> get_permutation_entropy(random_series)
0.876  # High entropy - random pattern

Notes

The choice of order and delay can significantly affect the PE value. Common practice is to use order=3 and delay=1 for many applications.

twiga.core.stats.entropy.get_hurst_exponent(series)#

Calculate Hurst exponent for a time series.

The Hurst exponent (H) measures the long-term memory of a time series. It quantifies the tendency of a time series to either regress strongly to the mean (H < 0.5), behave like a random walk (H ≈ 0.5), or exhibit long-term positive autocorrelation (H > 0.5).

Parameters:

series (ndarray | Series) – One-dimensional time series data.

Return type:

float

Returns:

Hurst exponent value. Returns np.nan if calculation fails.

Examples

>>> import numpy as np
>>> random_series = np.random.rand(100)
>>> get_hurst_exponent(random_series)
0.5  # Random walk - no long-term memory

Residual diagnostics#

twiga.core.stats.residual.cooks_distance(standardized_residuals, leverage, n_features)#

Compute Cook’s distance influence of each observation.

Parameters:
Return type:

ndarray[tuple[Any, ...], dtype[double]]

Returns:

Cook’s distance \(D_i\) for each observation.

Notes

Cook’s distance measures how much the regression coefficients would change if the \(i\)-th point were deleted:

\[\begin{split}D_i = \\frac{\\tilde{r}_i^2}{p+1} \\cdot \\frac{h_{ii}}{1 - h_{ii}}\end{split}\]

where \(p\) is the number of predictors.

Rule of thumb: - \(D_i > 1\): highly influential - \(D_i > \\frac{4}{n}\): potentially problematic

Examples

>>> std_res = np.array([2.0, 0.5, 3.0])
>>> lev = np.array([0.8, 0.3, 0.9])
>>> cooks_distance(std_res, lev, n_features=2)
array([8.0, 0.107..., 27.0])

Next: Data Pipeline | Metrics | Visualization