Feature Analysis & Statistical Toolkit#
Source Files
twiga/core/data/relevance.py-AssociationAnalyzerclasstwiga/core/data/selection.py-select_top_features,compute_rf_importancetwiga/core/stats/association.py-rank_correlation,compute_anova_f,compute_chi2twiga/core/stats/autocorr.py-get_pacf_values,estimate_ar_order,calculate_durbin_watsontwiga/core/stats/entropy.py- entropy & complexity measurestwiga/core/stats/mutual_information.py-compute_mutual_informationtwiga/core/stats/ppscore.py-compute_ppscoretwiga/core/stats/residual.py-cooks_distance,leverage_statistic,get_standardized_residualtwiga/core/stats/seasonality.py-get_acf_values,get_trend_seasonalitytwiga/core/stats/stationarity.py-adf_test,kpss_testtwiga/core/stats/utils.py-get_optimal_lags,normalise_series,get_entropy_parameterstwiga/core/stats/xicorr.py-compute_xicorr
Twiga provides a comprehensive statistical analysis toolkit for understanding and preparing time series data before modelling. All functions are accessible through a flat import API:
from twiga.core.stats import rank_correlation, adf_test, get_acf_values, compute_xicorr
from twiga.core.data import AssociationAnalyzer, select_top_features, compute_rf_importance
Association Analysis#
AssociationAnalyzer#
The AssociationAnalyzer class provides a single interface to compute feature-target association using eight methods spanning linear, rank-based, information-theoretic, and model-based approaches. All methods return a tidy DataFrame with columns feature, target, and a score column, making results easy to visualise and compare.
from twiga.core.data import AssociationAnalyzer
analyzer = AssociationAnalyzer(
data=df,
features=["temperature", "ghi", "hour_sin", "hour_cos", "wind_speed"],
target="load_mw",
task="regression", # "regression" or "classification"
)
# Single method
pearson_df = analyzer.compute(method="pearson")
# All eight methods at once
all_df = analyzer.compute_all()
# Heatmap of scores
fig = analyzer.plot_heatmap(method="spearman", top_k=20)
Supported Methods#
Method |
Key |
Strength |
|---|---|---|
Pearson correlation |
|
Fast, interpretable for linear relationships |
Spearman rank correlation |
|
Robust to outliers, monotonic relationships |
Kendall tau |
|
Small-sample stability, concordance-based |
Xi-correlation (Chatterjee) |
|
Detects arbitrary functional relationships |
Predictive Power Score |
|
Asymmetric, model-based, captures non-linearity |
Mutual Information |
|
Non-parametric, captures any statistical dependency |
ANOVA F-score |
|
Regression - linear effect of continuous feature on target |
Chi-squared |
|
Classification - independence test for categorical/binned features |
Choosing a method
For energy forecasting with continuous features, Spearman is a safe default (robust, interpretable). Add xi-correlation or PPS when you suspect non-monotonic relationships (e.g. solar generation vs. temperature in high-irradiance regimes). Use MI to catch interactions that linear methods miss.
rank_correlation#
Computes pairwise rank-based correlation (Pearson, Spearman, or Kendall) between features and a target, returning a sorted DataFrame.
from twiga.core.stats import rank_correlation
scores = rank_correlation(
data=df,
target="load_mw",
features=["temperature", "ghi", "wind_speed"],
method="spearman", # "pearson" | "spearman" | "kendall"
threshold=0.2, # optional: drop features below this absolute score
)
# Returns DataFrame with columns: feature, target, correlation - sorted descending
compute_anova_f#
Computes the ANOVA F-score between continuous features and a target for regression tasks.
from twiga.core.stats import compute_anova_f
scores = compute_anova_f(
data=df,
target="load_mw",
features=feature_columns,
task="regression",
normalize=True, # normalise scores to [0, 1]
)
compute_chi2#
Computes Chi-squared statistics between (non-negative) features and a categorical target. Useful as a fast filter before applying heavier methods.
from twiga.core.stats import compute_chi2
scores = compute_chi2(
data=df,
target="category",
features=feature_columns,
)
# Returns DataFrame with columns: feature, target, chi2, p_value
compute_xicorr#
Computes Chatterjee’s Xi-correlation - a non-parametric measure that is zero for independent variables and tends to one for functional (deterministic) relationships. Unlike Pearson/Spearman, it detects non-monotonic dependencies.
from twiga.core.stats import compute_xicorr
scores = compute_xicorr(
data=df,
target="load_mw",
features=feature_columns,
)
# Returns DataFrame sorted descending by xi score
compute_mutual_information#
Estimates mutual information between features and a target using k-NN density estimation. Works for both regression and classification.
from twiga.core.stats import compute_mutual_information
scores = compute_mutual_information(
data=df,
target="load_mw",
features=feature_columns,
task="regression",
threshold=0.01, # optional: drop near-zero MI scores
pivot=False, # True → wide matrix; False → long DataFrame
)
compute_ppscore#
Computes the Predictive Power Score (PPS) - an asymmetric, normalised score in [0, 1] that measures how well a feature predicts a target using a decision tree. A score of 0 means no predictive value; 1 means perfect prediction.
from twiga.core.stats import compute_ppscore
scores = compute_ppscore(
data=df,
target="load_mw",
features=feature_columns,
catch_errors=True, # return error dict instead of raising on failure
)
# Returns a dict: {feature_name: {"ppscore": float, "model_score": float, ...}}
Feature Selection#
select_top_features#
Ensemble feature ranking that combines six association signals into a single ranked list using Borda count (or other aggregation strategies). Robust to the noise and bias of any individual metric.
from twiga.core.data.selection import select_top_features
top_features = select_top_features(
data=df,
features=feature_columns,
target="load_mw",
task="regression",
top_k=15, # return the top-k ranked features
alpha=0.05, # optional p-value threshold (spearman/anova)
return_scores=True, # include per-method score columns in output
rank_method="borda_rank", # "borda_rank" | "geom_rank" | "arith_rank" | "med_rank" | "sum_rank"
include_chi2=False, # set True for classification tasks
)
The six signals combined:
Signal |
Method |
|---|---|
Rank correlation |
Spearman (regression) / ANOVA (classification) |
Mutual information |
k-NN density estimation |
Xi-correlation |
Chatterjee’s ξ |
Predictive Power Score |
Decision-tree PPS |
RF feature importance |
Mean Decrease in Impurity |
Chi-squared (optional) |
χ² independence test |
compute_rf_importance#
Computes Random Forest mean-decrease-in-impurity (MDI) importance as a long-format DataFrame.
from twiga.core.data.selection import compute_rf_importance
importance_df = compute_rf_importance(
data=df,
features=feature_columns,
target="load_mw",
task="regression", # "regression" or "classification"
)
# Returns DataFrame with columns: feature, importance (sums to ~1.0)
Time Series Diagnostics#
Stationarity Tests#
from twiga.core.stats import adf_test, kpss_test
# Augmented Dickey-Fuller — H0: unit root present (non-stationary)
# Returns the raw statsmodels result tuple: (stat, p_value, n_lags, n_obs, critical_values)
adf_result = adf_test(series, maxlag=None, regression="c", autolag="BIC")
stat, p_val, n_lags, n_obs, crit = adf_result
print(f"ADF stat={stat:.4f} p={p_val:.4f}")
# p < 0.05 → reject H0 → evidence of stationarity
# KPSS — H0: series IS stationary (level or trend)
# Returns: (stat, p_value, n_lags, critical_values)
kpss_result = kpss_test(series, regression="c", maxlag=None)
stat, p_val, n_lags, crit = kpss_result
print(f"KPSS stat={stat:.4f} p={p_val:.4f}")
# p < 0.05 → reject H0 → evidence of non-stationarity
Both functions validate input length (≥ 10 observations), compute sensible lag bounds automatically, and suppress statsmodels InterpolationWarning by default.
Combining ADF and KPSS
Use both tests together for a stronger conclusion:
ADF |
KPSS |
Conclusion |
|---|---|---|
Reject H0 (p < 0.05) |
Fail to reject H0 (p ≥ 0.05) |
Stationary |
Fail to reject H0 |
Reject H0 |
Non-stationary (unit root) |
Reject H0 |
Reject H0 |
Trend-stationary or structural break |
Fail to reject H0 |
Fail to reject H0 |
Inconclusive — more data needed |
Seasonal Decomposition (MSTL)#
get_trend_seasonality uses MSTL (Multiple Seasonal-Trend decomposition using Loess) to separate a time series into trend, one or more seasonal components, and a residual. It handles multiple simultaneous seasonalities (e.g. daily + weekly in a 30-minute series).
from twiga.core.stats import get_trend_seasonality
# Single seasonality — daily cycle at 30-min resolution (48 samples/day)
result = get_trend_seasonality(
series,
periods=[48],
component_names=["Daily"],
)
# result.columns → ["Daily", "Trend", "Residual"]
# Multiple seasonalities — daily + weekly
result = get_trend_seasonality(
series,
periods=[48, 336], # 336 = 48 * 7
component_names=["Daily", "Weekly"],
)
# result.columns → ["Daily", "Weekly", "Trend", "Residual"]
The returned DataFrame has the same DatetimeIndex as the input series.
Period sizing
statsmodels silently drops any period that exceeds half the series length. Twiga logs a warning when this happens and still returns the remaining components, so decompositions never fail silently.
Tuning parameters:
Parameter |
Default |
Effect |
|---|---|---|
|
|
Outer MSTL iterations — more = smoother but slower |
|
|
LOESS degree for seasonal smoother (0 = constant, 1 = linear) |
|
|
STL inner loop iterations |
|
|
STL robustness iterations (> 0 downweights outliers) |
ACF / PACF#
from twiga.core.stats import get_acf_values, get_pacf_values, estimate_ar_order
# Autocorrelation function
acf_df = get_acf_values(series, max_lag=48, alpha=0.05)
# Returns DataFrame: lag, acf, lower_ci, upper_ci
# Partial autocorrelation function
pacf_df = get_pacf_values(series, max_lag=48, alpha=0.05)
# Returns DataFrame: lag, pacf, lower_ci, upper_ci
# Suggest AR order from PACF
p = estimate_ar_order(series, max_lag=48)
Durbin-Watson#
from twiga.core.stats import calculate_durbin_watson
dw = calculate_durbin_watson(residuals)
# Returns float in [0, 4]; ~2 means no autocorrelation
Entropy & Complexity Measures#
The entropy module characterises time series complexity, which is useful for diagnosing predictability and selecting model types.
from twiga.core.stats import (
get_permutation_entropy,
get_approx_entropy,
get_sample_entropy,
get_hurst_exponent,
get_dfa_exponent,
get_entropy_parameters,
)
# Permutation entropy - fast, robust to noise
pe = get_permutation_entropy(series, order=3, delay=1, normalize=True)
# Approximate entropy - regularity measure
ae = get_approx_entropy(series, order=2, metric="chebyshev")
# Sample entropy - less biased than ApEn for short series
se = get_sample_entropy(series, order=2, metric="chebyshev")
# Hurst exponent - long-range dependence (0.5 = random walk, >0.5 = trending)
h = get_hurst_exponent(series)
# Detrended Fluctuation Analysis exponent
dfa = get_dfa_exponent(series)
# Auto-select appropriate entropy parameters for a given series
params = get_entropy_parameters(series)
Measure |
Range |
Interpretation |
|---|---|---|
Permutation entropy |
[0, 1] |
0 = fully regular, 1 = maximally complex |
Approximate entropy |
≥ 0 |
Higher = less regular |
Sample entropy |
≥ 0 |
Higher = less regular, less biased than ApEn |
Hurst exponent |
(0, 1) |
< 0.5 anti-persistent, 0.5 random, > 0.5 persistent |
DFA exponent |
≥ 0 |
Similar to Hurst; robust to non-stationarity |
Residual Diagnostics#
After model fitting, residual diagnostics help identify influential observations and model misspecification.
from twiga.core.stats import (
cooks_distance,
leverage_statistic,
get_standardized_residual,
)
# Cook's distance - influence of each observation on fitted coefficients
cd = cooks_distance(fitted_values, residuals, n_params)
# Returns np.ndarray; values > 4/n are typically flagged
# Leverage - how far each point is from the design-matrix centroid
lev = leverage_statistic(X)
# Returns np.ndarray of hat-matrix diagonals
# Standardised residuals
std_resid = get_standardized_residual(residuals)
Utilities#
from twiga.core.stats import normalise_series, get_optimal_lags, prepare_data
# Normalise a series to [0, 1] or standardise to zero mean / unit variance
norm = normalise_series(series, method="minmax") # "minmax" | "zscore"
# Estimate optimal lag order (AIC/BIC information criterion)
lag = get_optimal_lags(series, max_lag=48, criterion="aic")
# Prepare a feature matrix and target vector from a DataFrame
X, y = prepare_data(df, target="load_mw", features=feature_columns)
API Reference#
Feature selection & association#
- class twiga.core.data.relevance.AssociationAnalyzer(method='spearman', task='regression', **kwargs)#
Bases:
objectA unified interface for calculating and visualizing feature associations.
- static compute(data, target_col, variable_cols, method='spearman', task='regression', **kwargs)#
Calculate association between target and features using the chosen method.
- Parameters:
data (
DataFrame) – Input DataFrame.target_col (
str) – The dependent variable.method (
Literal['pearson','spearman','kendall','xicor','pps','mi','anova','chi2']) – The statistical method to use.task (
Literal['regression','classification']) – Task type (regression/classification) for PPS, MI, and ANOVA.**kwargs (
Any) – Method-specific args: - pps: sample, cross_validation, random_seed - xicor: ties - mi: random_state, n_neighbors - rank: N/A (standard pandas implementation)
- Return type:
- Returns:
DataFrame with columns [‘target’, ‘feature’, ‘score’] and optionally [‘p_value’].
- static plot_heatmap(assoc_df, score_col='score', target_col_name='target', feature_col_name='feature')#
Visualize association results using a heatmap.
- twiga.core.data.selection.select_top_features(data, features, target, task='regression', top_k=5, alpha=0.05, rank_aggregation='borda_count', include_pps=True, include_xi=True, include_rf=True, include_chi2=True, random_state=42, return_scores=False)#
Selects top-k features using a robust ensemble of statistical metrics.
Combines linear, non-linear, and model-based relevance scores. Note: Alpha filtering applies only to ANOVA and Chi-square p-values.
- twiga.core.data.selection.compute_rf_importance(data, target, features, task='regression', n_estimators=50, random_state=42, n_jobs=-1, pivot=True)#
Compute Random Forest feature importance using SelectFromModel.
Uses mean impurity decrease (MDI) across trees, selected via a mean-importance threshold internally. MDI importance is a model-based measure - unlike statistical tests it captures non-linear interactions automatically.
- Parameters:
data (
DataFrame) – Input DataFrame.target (
str) – Target column.task (
str) – “regression” or “classification”.n_estimators (
int) – Number of trees in the forest. Defaults to 50.n_jobs (
int) – Number of jobs for parallelisation. Defaults to -1 (all cores).pivot (
bool) – If True, pivot result with features as index. Defaults to True.
- Return type:
- Returns:
DataFrame with [“feature”, “rf_importance”] in long format, or pivoted with features as index.
Example
>>> df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]}) >>> compute_rf_importance(df, "y", ["x1", "x2"], task="classification")
- twiga.core.stats.association.rank_correlation(data, target_col, variable_cols, method='spearman')#
Compute correlation between target and each variable.
- Parameters:
- Return type:
- Returns:
DataFrame with columns [‘target’, ‘feature’, ‘correlation’] sorted by absolute correlation descending.
- twiga.core.stats.association.compute_anova_f(data, target, features, task='regression', normalize=False, pivot=False)#
Compute ANOVA F-scores and p-values for features.
- Parameters:
- Return type:
- Returns:
Tuple of (f_scores_df, p_values_df).
- twiga.core.stats.association.compute_chi2(data, target, features, pivot=False)#
Compute Chi-square statistic for categorical features.
Requires non-negative feature values (e.g., counts, binary flags).
- Return type:
- twiga.core.stats.xicorr.compute_xicorr(data, target_col, variable_cols, ties='auto')#
Compute Xi correlation between target and multiple features.
- Parameters:
data (
DataFrame) – Input DataFrame.target_col (
str) – Name of the dependent variable (y).variable_cols (
list[str]) – List of independent variables (x) to test against the target.ties (
bool|str) – Whether to account for ties in the target variable y. ‘auto’ is recommended (detects ties automatically).
- Return type:
- Returns:
DataFrame with columns [‘col1’, ‘col2’, ‘xicor’, ‘p_value’] sorted by xicor descending.
- twiga.core.stats.mutual_information.compute_mutual_information(data, target, features, task='regression', threshold=None, random_state=200, pivot=True)#
Compute mutual information between features and target.
- twiga.core.stats.selection.compute_mutual_information(data, target, features, task='regression', threshold=None, random_state=200, pivot=True, n_jobs=-1)#
Compute mutual information in parallel per feature.
- Parameters:
- Return type:
- Returns:
DataFrame with [“target”, “feature”, “score”] or pivoted.
Example
>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [1, 2, 3]}) >>> compute_mutual_information(df, "y", "x", task="regression")
- twiga.core.stats.ppscore.compute_ppscore(df, x, y, sample=5000, cross_validation=4, random_seed=123, catch_errors=True)#
Calculates the Predictive Power Score for a single (x, y) pair.
- Parameters:
df (
DataFrame) – The input pandas DataFrame.x (
str) – The name of the predictor (feature) column.y (
str) – The name of the target column to be predicted.sample (
int|None) – Max rows to sample. If None, uses all rows.cross_validation (
int) – Number of folds for cross-validation.random_seed (
int) – Seed for reproducibility.catch_errors (
bool) – If True, returns error dict instead of raising Exception.
- Return type:
- Returns:
A dictionary containing the PPS and underlying metrics.
Lag & embedding selection#
- twiga.core.stats.ami.get_ami_profile(series, n_neighbors=8, max_horizons=None, random_state=42)#
Compute the horizon-specific Auto-Mutual Information (AMI) profile.
AMI(h) = I(Δx_t ; Δx_{t+h}) is estimated via the k-nearest-neighbour (kNN) estimator of Kraskov et al. (2004) as implemented in
sklearn.feature_selection.mutual_info_regression(). The kNN estimator is invariant to monotone rescaling, so no standardisation of the differenced series is applied.The profile is computed on the first-differenced series to remove level persistence and trend. The raw (level) series is not modified by this function.
- Parameters:
series (
ndarray|Series) – 1-D target series as a numpy array or pandas Series. Must be free of NaN values and have length >=n_neighbors + 2after differencing.n_neighbors (
int) – Number of nearest neighbours for the kNN MI estimator. Higher values reduce variance but increase bias. Defaults to8, following the convention of Catt (2024).max_horizons (
int|None) – Maximum horizon h to include in the profile. WhenNonethe maximum feasible horizonn_diff - (n_neighbors + 1)is used. Provide an explicit value to cap the profile for long series and reduce runtime. Defaults toNone.random_state (
int) – Random seed for the kNN MI estimator. Defaults to42.
- Return type:
- Returns:
Tuple
(horizons, ami_values)where –horizonsis a 1-D integer array[1, 2, ..., max_h].ami_valuesis a 1-D float array of non-negative AMI values (in nats) at each horizon, clipped to[0, inf).
Both arrays are empty when the differenced series is too short to compute at least one horizon.
- Raises:
ValueError – If
seriesis not 1-D after squeezing.
Examples
>>> import numpy as np >>> rng = np.random.default_rng(0) >>> signal = np.sin(2 * np.pi * np.arange(500) / 48) + 0.1 * rng.standard_normal(500) >>> horizons, ami = get_ami_profile(signal, max_horizons=96) >>> print(f"AMI(h=1)={ami[0]:.3f} peak_h={horizons[ami.argmax()]}")
- twiga.core.stats.utils.get_optimal_lags(series, max_lags=None, min_data_points=30, period=None, lb_test=True)#
Select optimal autoregressive lags using multiple information criteria.
The selection process evaluates fitted autoregressive models and compares:
AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)
HQIC (Hannan-Quinn Information Criterion)
Optionally, Ljung-Box residual diagnostics are also computed.
- Parameters:
- Return type:
- Returns:
Dictionary containing optimal lag by criterion and diagnostics.
- Raises:
ValueError – If the series is too short or no model can be fitted.
- twiga.core.stats.utils.get_optimal_embedding(period)#
Determine an embedding dimension from a pandas-style frequency period.
- twiga.core.stats.utils.get_entropy_parameters(series, method, period=None, data_driven_delay=True)#
Recommend entropy embedding/tolerance parameters from theory-informed rules.
Rules implemented: - Sample/Approximate entropy:
min {2, 3},rin [0.1*sigma, 0.25*sigma]. - Permutation entropy:min [3, 7] with sample-size constraintn >> m!. - Optional data-driven delay from autocorrelation first crossing.- Parameters:
- Return type:
- Returns:
Parameter dictionary suitable for entropy functions.
- Raises:
ValueError – If method is unsupported or series is invalid.
- twiga.core.stats.utils.get_sample_per_day(frequency)#
Calculate the number of samples per day from a sampling frequency.
Supports pandas frequency strings and unit-only abbreviations like
h,min, ands.- Parameters:
frequency (
str|Timedelta) – Sampling frequency as a string or pandas Timedelta.- Return type:
- Returns:
Number of samples per day for the given frequency.
- Raises:
ValueError – If the frequency cannot be parsed or is invalid.
- twiga.core.stats.utils.normalise_series(series)#
Normalize a series by centering and scaling to unit variance.
- Parameters:
- Return type:
- Returns:
Normalized series values.
- Raises:
ValueError – If the series is empty or has zero/invalid variance.
Stationarity & seasonality#
- twiga.core.stats.stationarity.adf_test(values, maxlag=None, regression='c', autolag='BIC')#
Run the Augmented Dickey-Fuller (ADF) test.
- Parameters:
- Return type:
- Returns:
Statsmodels ADF result tuple.
- Raises:
ValueError – If input validation fails or test execution fails.
- twiga.core.stats.stationarity.kpss_test(values, regression='c', maxlag=None, suppress_interpolation_warning=True)#
Run the KPSS stationarity test.
- Parameters:
- Return type:
- Returns:
Statsmodels KPSS result tuple.
- Raises:
ValueError – If input validation fails or test execution fails.
- twiga.core.stats.seasonality.get_acf_values(series, maxlag=None, alpha=0.05, seasonal_length=None, fft=True, adjusted=False, missing='none')#
Compute ACF values and confidence intervals excluding lag 0.
- Parameters:
maxlag (
int|None) – Maximum lag to compute. If None, inferred from series length.alpha (
float) – Significance level for confidence intervals.seasonal_length (
int|None) – Optional seasonal period used to ensure enough lag coverage.fft (
bool) – Whether to compute ACF using FFT.adjusted (
bool) – Whether to use n-k denominator in autocovariance.missing (
str) – Missing-value handling mode for statsmodels ACF.
- Return type:
- Returns:
Tuple of (acf_values, acf_confint), both excluding lag 0.
- Raises:
ValueError – If ACF calculation fails.
- twiga.core.stats.seasonality.get_trend_seasonality(series, periods, component_names=None, iterate=3, seasonal_deg=0, inner_iter=2, outer_iter=0)#
Decompose a time series into trend, seasonal components, and residuals using MSTL.
MSTL (Multiple Seasonal-Trend decomposition using Loess) extends STL to handle multiple seasonality periods simultaneously — e.g. daily (48) and weekly (48 * 7) patterns in a 30-minute series.
- Parameters:
series (
Series) – Input time series as a pandas Series with a DatetimeIndex. Must not contain NaN values.periods (
list[int]) – List of seasonal period lengths in number of observations. For example,[48]for a single daily cycle at 30-minute resolution, or[48, 336]for daily + weekly.component_names (
list[str] |None) – Names for the seasonal components, one per period. IfNone, defaults to["Seasonal_<p>" for p in periods].iterate (
int) – Number of outer MSTL iterations. Defaults to3.seasonal_deg (
int) – Degree of the seasonal LOESS smoother (0 = constant, 1 = linear). Defaults to0.inner_iter (
int) – Number of inner loop iterations in STL. Defaults to2.outer_iter (
int) – Number of outer (robustness) iterations in STL. Defaults to0.
- Return type:
- Returns:
DataFrame with the same index as
seriescontaining one column per seasonal component (named fromcomponent_names), plus"Trend"and"Residual"columns.- Raises:
ValueError – If
len(periods) != len(component_names)whencomponent_namesis provided.ValueError – If
seriescontains NaN values.ValueError – If any period is less than 2.
Examples
>>> import pandas as pd, numpy as np >>> idx = pd.date_range("2024-01-01", periods=1000, freq="30min") >>> ts = pd.Series(np.sin(np.linspace(0, 40 * np.pi, 1000)), index=idx) >>> result = get_trend_seasonality(ts, periods=[48], component_names=["Daily"]) >>> result.columns.tolist() ['Daily', 'Trend', 'Residual']
- twiga.core.stats.autocorr.get_pacf_values(series, maxlag, alpha=0.05, method='ywm')#
Calculate PACF values and confidence intervals.
Computes the Partial Autocorrelation Function using the Yule-Walker method and removes lag 0 (which is always 1.0) for practical use.
- Parameters:
series – Time series data as numpy array or pandas Series.
maxlag – Maximum lag order to calculate.
alpha – Significance level for confidence intervals. Default is 0.05 for 95% confidence.
method – PACF estimation method. Default is ‘ywm’ (Yule-Walker). Other options include ‘ols’, ‘ywunbiased’, ‘ld’.
- Returns:
Tuple of (pacf_values, pacf_confint) – - pacf_values: PACF coefficients excluding lag 0 (numpy array) - pacf_confint: Confidence intervals for each lag (numpy array of tuples)
Examples
>>> import numpy as np >>> series = np.random.randn(100) >>> pacf_vals, conf_int = get_pacf_values(series, maxlag=10) >>> print(f"PACF at lag 1: {pacf_vals[0]:.4f}")
- twiga.core.stats.autocorr.estimate_ar_order(series, maxlag, alpha=0.05, method='ywm')#
Estimate optimal AR model order using PACF analysis.
Implements the Box-Jenkins methodology for AR order selection by analyzing the Partial Autocorrelation Function cutoff pattern. An AR(p) process should have significant PACF values up to lag p, then cut off to zero.
- Parameters:
- Returns:
Tuple of (significant_lags, suggested_ar_order) –
significant_lags: List of integer lag values where PACF is statistically significant (outside confidence interval)
suggested_ar_order: Integer AR order where PACF cuts off. Represents the last consecutive significant lag.
Examples
>>> import numpy as np >>> # Simulate AR(2) process >>> series = np.random.randn(200) >>> sig_lags, ar_order = estimate_ar_order(series, maxlag=10) >>> print(f"Suggested AR order: {ar_order}") >>> print(f"All significant lags: {sig_lags}")
Note
If PACF shows non-consecutive significant lags (e.g., lags 1, 2, 5 are significant but 3, 4 are not), this suggests a more complex process (ARMA or seasonal) rather than pure AR.
Complexity & entropy#
- twiga.core.stats.entropy.get_permutation_entropy(series, order=None, delay=1, normalize=True)#
Calculate permutation entropy for a time series.
Permutation entropy (PE) measures the complexity of a time series by analyzing the order relations between values. It captures the diversity of patterns in the data, with higher values indicating more complex and less predictable behavior.
- Parameters:
series (
ndarray|Series) – One-dimensional time series data.order (
int|None) – Embedding dimension (pattern length). Determines the length of sequences to analyze. Default is 3.delay (
int) – Time delay between elements in the sequence. Default is 1.normalize (
bool) – Whether to normalize the entropy value to [0, 1]. Default is True.
- Return type:
- Returns:
Permutation entropy value. Returns np.nan if calculation fails.
Examples
>>> import numpy as np >>> regular_series = np.array([1, 2, 3, 1, 2, 3]) >>> get_permutation_entropy(regular_series) 0.123 # Low entropy - regular pattern
>>> random_series = np.random.rand(100) >>> get_permutation_entropy(random_series) 0.876 # High entropy - random pattern
Notes
The choice of order and delay can significantly affect the PE value. Common practice is to use order=3 and delay=1 for many applications.
- twiga.core.stats.entropy.get_hurst_exponent(series)#
Calculate Hurst exponent for a time series.
The Hurst exponent (H) measures the long-term memory of a time series. It quantifies the tendency of a time series to either regress strongly to the mean (H < 0.5), behave like a random walk (H ≈ 0.5), or exhibit long-term positive autocorrelation (H > 0.5).
- Parameters:
series (
ndarray|Series) – One-dimensional time series data.- Return type:
- Returns:
Hurst exponent value. Returns np.nan if calculation fails.
Examples
>>> import numpy as np >>> random_series = np.random.rand(100) >>> get_hurst_exponent(random_series) 0.5 # Random walk - no long-term memory
Residual diagnostics#
- twiga.core.stats.residual.cooks_distance(standardized_residuals, leverage, n_features)#
Compute Cook’s distance influence of each observation.
- Parameters:
- Return type:
- Returns:
Cook’s distance \(D_i\) for each observation.
Notes
Cook’s distance measures how much the regression coefficients would change if the \(i\)-th point were deleted:
\[\begin{split}D_i = \\frac{\\tilde{r}_i^2}{p+1} \\cdot \\frac{h_{ii}}{1 - h_{ii}}\end{split}\]where \(p\) is the number of predictors.
Rule of thumb: - \(D_i > 1\): highly influential - \(D_i > \\frac{4}{n}\): potentially problematic
Examples
>>> std_res = np.array([2.0, 0.5, 3.0]) >>> lev = np.array([0.8, 0.3, 0.9]) >>> cooks_distance(std_res, lev, n_features=2) array([8.0, 0.107..., 27.0])
Next: Data Pipeline | Metrics | Visualization