Forecastability Analysis#

What you’ll build

A Forecastability Profile for the MLVS-PT net-load signal - a concise set of metrics (entropy, seasonality, stationarity, feature correlation) that tells you how predictable the series is and directly maps to every parameter in DataPipelineConfig.

Prerequisites

NB01 - Getting Started
Basic Python (lists, dicts, imports)

Learning objectives

By the end of this notebook you will be able to:

Explain what entropy metrics measure and interpret Permutation Entropy, Hurst Exponent, and Sample Entropy for a time series
Identify dominant seasonal periods using the Autocorrelation Function (ACF) and translate them into lags and lookback_window_size
Estimate the autoregressive order of a series from the Partial Autocorrelation Function (PACF)
Determine whether a series is stationary using ADF and KPSS tests and select an appropriate scaler
Rank exogenous features by their predictive power using Xi correlation and decide which to include in exogenous_features

The five-step workflow

flowchart LR
    A["📈 Time Series<br>(raw)"]
    B["📉 Entropy<br>(structure)"]
    C["🔁 Seasonality<br>(ACF/PACF)"]
    D["📏 Stationarity<br>(ADF/KPSS)"]
    E["🔗 Feature Corr<br>(xicor)"]
    F["⚙ Profile<br>(config)"]

    A --> B --> C --> D --> E --> F

Every analysis step in this notebook produces a concrete recommendation for DataPipelineConfig - the profile at the end assembles them all.

Setup#

We import four groups of libraries:

warnings: suppress noisy deprecation messages so output stays readable.
great_tables / lets_plot: rendering styled tables and interactive plots inside the notebook.
numpy / pandas: array maths and tabular data manipulation.
Twiga utilities: configure() sets up logging; get_logger() gives us a labelled logger for clean output.
Twiga plot helpers: plot_acf, plot_metrics_bar, plot_timeseries are thin wrappers around LetsPlot tuned for time series work.

import warnings

warnings.filterwarnings("ignore")

import json

from great_tables import GT, md
from lets_plot import LetsPlot, aes, geom_line, geom_point, gggrid, ggplot, ggsize, ggtitle, labs
import numpy as np
import pandas as pd

LetsPlot.setup_html()

from twiga.core.plot import TWIGA_PALETTE, plot_acf, plot_metrics_bar, plot_timeseries, twiga_theme
from twiga.core.plot.gt import twiga_gt
from twiga.core.utils import configure, get_logger

configure()
log = get_logger("tutorials")

# Load dataset
data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature", "Season"])
data["timestamp"] = pd.to_datetime(data["timestamp"])
# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)

log.info(f"Shape : {data.shape}")
log.info(f"Period: {data['timestamp'].min()} → {data['timestamp'].max()}")
twiga_gt(GT(data.head(3)))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 2
# Load dataset
----> 2 data = pd.read_parquet("../data/MLVS-PT.parquet", columns=["timestamp", "NetLoad(kW)", "Ghi", "Temperature", "Season"])
data["timestamp"] = pd.to_datetime(data["timestamp"])
# Restrict to 2019-2020 to keep tutorial execution fast
data = data[(data["timestamp"] >= "2019-01-01") & (data["timestamp"] <= "2020-12-31")].reset_index(drop=True)

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
   use_nullable_dtypes = False
check_dtype_backend(dtype_backend)
--> 669 return impl.read(
   path,
   columns=columns,
   filters=filters,
   storage_options=storage_options,
   use_nullable_dtypes=use_nullable_dtypes,
   dtype_backend=dtype_backend,
   filesystem=filesystem,
   **kwargs,
)

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
if manager == "array":
   to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
   path,
   filesystem,
   storage_options=storage_options,
   mode="rb",
)
try:
   pa_table = self.api.parquet.read_table(
       path_or_handle,
       columns=columns,
   (...)    270         **kwargs,
   )

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
handles = None
if (
   not fs
   and not is_dir
   (...)    139     # fsspec resources can also point to directories
   # this branch is used for example when reading from non-fsspec URLs
--> 141     handles = get_handle(
       path_or_handle, mode, is_text=False, storage_options=storage_options
   )
   fs = None
   path_or_handle = handles.handle

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
       handle = open(
           handle,
           ioargs.mode,
   (...)    878             newline="",
       )
   else:
       # Binary mode
--> 882         handle = open(handle, ioargs.mode)
   handles.append(handle)
# Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'

# Extract target as numpy array — used throughout
series = data["NetLoad(kW)"].values
log.info(f"Series length : {len(series):,} samples")
log.info(f"Min / Max     : {series.min():.1f} / {series.max():.1f} kW")
log.info(f"Mean ± Std    : {series.mean():.1f} ± {series.std():.1f} kW")

# Visual overview — full series so we can see structure, cycles, and any anomalies
p = plot_timeseries(
    data,
    date_col="timestamp",
    y_cols="NetLoad(kW)",
    title="Net Load — MLVS-PT (2019 – 2020)",
    y_label="Net Load (kW)",
    x_label="Date",
    fig_size=(920, 280),
)
p

Key concept - Entropy

In information theory, entropy measures how unpredictable a sequence is. A completely regular signal (e.g., a perfect sine wave) has near-zero entropy - you can predict each value perfectly from the pattern. A completely random sequence (white noise) has maximum entropy - there is no pattern to exploit.

For time series forecasting, lower entropy means the signal has more exploitable structure, so a model can learn better. The metrics below each capture a different aspect of this regularity:

Permutation Entropy (PE) - compares the frequency of ordinal patterns in the series. Near 0 = highly regular; near 1 = chaotic.

Sample Entropy (SampEn) - measures the probability that similar subsequences remain similar when extended by one step. Low values mean the series is self-similar (predictable).

Hurst Exponent (H) - tests for long-range memory. H > 0.5 means the series is persistent: an upward trend tends to continue; a downward trend tends to continue. H < 0.5 means mean-reverting (less useful for direct forecasting).

DFA Exponent (α) - Detrended Fluctuation Analysis; similar interpretation to the Hurst exponent but more robust to non-stationarity.

1. Entropy: How much structure exists?#

Entropy metrics quantify the balance between regularity and randomness in the signal. A series with low entropy has strong, repeating patterns and is generally easier to forecast.

Metric	Scale	Forecastable signal
Permutation Entropy (PE)	0 → 1	near 0 (regular)
Approximate Entropy (ApEn)	≥ 0	low
Sample Entropy (SampEn)	≥ 0	< 0.5
Hurst Exponent (H)	0 → 1	> 0.5 (persistent)
DFA Exponent (α)	0 → 2	> 0.5 (persistent)

entropy_ref = pd.DataFrame(
    {
        "Metric": [
            "Permutation Entropy (PE)",
            "Approximate Entropy (ApEn)",
            "Sample Entropy (SampEn)",
            "Hurst Exponent (H)",
            "DFA Exponent (α)",
        ],
        "Scale": ["0 → 1", "≥ 0", "≥ 0", "0 → 1", "0 → 2"],
        "Forecastable signal": ["near 0 (regular)", "low", "< 0.5", "> 0.5 (persistent)", "> 0.5 (persistent)"],
        "Interpretation": [
            "Near 0 = strong repeating patterns; near 1 = chaotic",
            "Low = self-similar structure; high = complex",
            "< 0.5 = highly regular; > 1.0 = high complexity",
            "> 0.5 = persistent memory (past trends continue)",
            "> 0.5 = long-range correlations in the signal",
        ],
    }
)

twiga_gt(
    GT(entropy_ref)
    .tab_header(
        title=md("**Entropy Metrics — Reference Guide**"), subtitle="Lower entropy / higher Hurst = more forecastable"
    )
    .cols_label(
        Metric=md("**Metric**"),
        Scale=md("**Scale**"),
        **{"Forecastable signal": md("**Forecastable signal**")},
        Interpretation=md("**Interpretation**"),
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(entropy_ref),
)

from twiga.core.stats.entropy import (
    get_approx_entropy,
    get_dfa_exponent,
    get_hurst_exponent,
    get_permutation_entropy,
    get_sample_entropy,
)

log.info("Computing entropy metrics (this may take ~30 s) ...")

pe = get_permutation_entropy(series)
apen = get_approx_entropy(series)
se = get_sample_entropy(series)
h = get_hurst_exponent(series)
dfa = get_dfa_exponent(series)

entropy_summary = [
    {
        "Metric": "Permutation Entropy (PE)",
        "Value": pe,
        "Interpretation": "regular / predictable" if pe < 0.5 else "complex / chaotic",
    },
    {
        "Metric": "Approximate Entropy (ApEn)",
        "Value": apen,
        "Interpretation": "low = self-similar structure",
    },
    {
        "Metric": "Sample Entropy (SampEn)",
        "Value": se,
        "Interpretation": "highly regular" if se < 0.5 else "moderate complexity" if se < 1.0 else "high complexity",
    },
    {
        "Metric": "Hurst Exponent (H)",
        "Value": h,
        "Interpretation": "persistent memory (trending)" if h > 0.5 else "mean-reverting or random walk",
    },
    {
        "Metric": "DFA Exponent (α)",
        "Value": dfa,
        "Interpretation": "long-range correlations" if dfa > 0.5 else "white-noise-like",
    },
]

log.info("\nEntropy Summary")
table = pd.DataFrame(entropy_summary)

twiga_gt(
    GT(table.round({"Value": 3}))
    .tab_header(title=md("**Entropy Metrics — MLVS-PT Net Load**"))
    .cols_label(
        Metric=md("**Metric**"),
        Value=md("**Value**"),
        Interpretation=md("**Interpretation**"),
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(table),
)

Takeaway: A Hurst exponent H > 0.5 confirms the net-load series has persistent long-range memory - past values are informative for future ones. A low PE confirms repeating daily/weekly patterns dominate the signal, making it well-suited for model-based forecasting.

2. Seasonality: What cycles can we exploit?#

The Autocorrelation Function (ACF) reveals periodic structure. Peaks at multiples of the sampling frequency indicate exploitable seasonality.

For 30-minute data:

Lag 48 = 24 h daily cycle
Lag 336 = 7-day weekly cycle

from twiga.core.stats.seasonality import get_acf_values

# Compute ACF — maxlag=336 covers one full week of 30-min data
acf_values, acf_confint = get_acf_values(series, maxlag=336)

p = plot_acf(
    pd.Series(series, name="NetLoad(kW)"),
    max_lag=336,
    title="Autocorrelation Function — NetLoad",
    x_label="Lag (30-min intervals)",
)
p

# Dominant seasonal lags: ACF > 0.5
dominant_lags = [lag + 1 for lag, v in enumerate(acf_values) if abs(v) > 0.5][:10]
log.info(f"Dominant lags (|ACF| > 0.5): {dominant_lags}")
log.info(f"ACF at lag 48  (daily) : {acf_values[47]:.4f}")
log.info(f"ACF at lag 336 (weekly): {acf_values[335]:.4f}")

Takeaway: Strong ACF peaks at lag 48 and lag 336 confirm daily and weekly seasonality. These directly inform DataPipelineConfig.lookback_window_size and lags.

3. Autoregressive Order: Which lags are informative?#

The Partial Autocorrelation Function (PACF) isolates the direct effect of each lag after removing the influence of shorter lags. The lag at which PACF cuts off estimates the AR order p.

from twiga.core.stats.autocorr import estimate_ar_order, get_pacf_values

# Compute PACF up to lag 100
pacf_values, pacf_confint = get_pacf_values(series, maxlag=100)

p = plot_acf(
    pd.Series(series, name="NetLoad(kW)"),
    max_lag=100,
    partial=True,
    title="Partial Autocorrelation Function — NetLoad",
    x_label="Lag",
)
p

# Estimate AR order from PACF cutoff
significant_lags, ar_order = estimate_ar_order(series, maxlag=100)

ar_summary = pd.DataFrame(
    [
        {
            "Metric": "AR order (PACF cutoff)",
            "Value": ar_order,
            "Interpretation": "The smallest autoregressive order that captures direct lag dependence before PACF values become insignificant.",
        },
        {
            "Metric": "Recommended lag set",
            "Value": str([ar_order, 48]),
            "Interpretation": "Use the PACF cutoff and the dominant daily cycle to capture the most informative structure at 30-min resolution.",
        },
        {
            "Metric": "Recommended lookback_window_size",
            "Value": max(ar_order, 48),
            "Interpretation": "The lookback window should span the longest informative lag, so the model sees enough history.",
        },
    ]
)

log.info(f"Significant PACF lags : {significant_lags[:15]} ...")
log.info(f"Suggested AR order    : {ar_order}")
log.info("")

# Display the same interpretation as a styled table for clarity
twiga_gt(
    GT(ar_summary)
    .tab_header(
        title=md("**AR Order — PACF Interpretation**"),
        subtitle="Translate PACF cutoff and lag recommendations into DataPipelineConfig guidance",
    )
    .cols_label(
        Metric=md("**Metric**"),
        Value=md("**Value**"),
        Interpretation=md("**Interpretation**"),
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(ar_summary),
)

Takeaway: The PACF cutoff lag gives the minimal AR order that captures direct dependencies. Any lags beyond the cutoff are already explained by shorter lags and add noise rather than signal.

Key concept - Stationarity

A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Most forecasting models implicitly assume stationarity - if the distribution shifts, a model trained on historical data may perform poorly on future data.

ADF test (Augmented Dickey-Fuller): tests the null hypothesis that the series has a unit root (non-stationary). A small p-value (< 0.05) means we can reject the unit root - the series is likely stationary.

KPSS test (Kwiatkowski - Phillips - Schmidt - Shin): tests the null hypothesis that the series is stationary. A large p-value (> 0.05) means we cannot reject stationarity - consistent with a stationary series.

Using both tests together gives stronger evidence. If they agree (ADF rejects unit root AND KPSS cannot reject stationarity), a standard StandardScaler is appropriate. If they disagree, consider differencing or a RobustScaler.

4. Stationarity: Is the distribution stable?#

A stationary series has constant mean and variance over time. This determines the appropriate scaler and whether differencing is needed.

ADF p-value	KPSS p-value	Conclusion
< 0.05	> 0.05	Stationary - standard scaling sufficient
> 0.05	< 0.05	Non-stationary - consider differencing or robust scaling
< 0.05	< 0.05	Trend-stationary - trend present, detrend before fitting
> 0.05	> 0.05	Inconclusive - inspect further

stationarity_ref = pd.DataFrame(
    {
        "ADF p-value": ["< 0.05", "> 0.05", "< 0.05", "> 0.05"],
        "KPSS p-value": ["> 0.05", "< 0.05", "< 0.05", "> 0.05"],
        "Conclusion": ["Stationary", "Non-stationary", "Trend-stationary", "Inconclusive"],
        "Recommended scaler": [
            "StandardScaler()",
            "RobustScaler() or difference first",
            "Detrend, then StandardScaler()",
            "Inspect rolling statistics",
        ],
    }
)

twiga_gt(
    GT(stationarity_ref)
    .tab_header(title=md("**Stationarity Decision Table**"), subtitle="ADF + KPSS together give stronger evidence")
    .cols_label(
        **{
            "ADF p-value": md("**ADF p-value**"),
            "KPSS p-value": md("**KPSS p-value**"),
            "Conclusion": md("**Conclusion**"),
            "Recommended scaler": md("**Recommended scaler**"),
        },
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(stationarity_ref),
)

from twiga.core.stats.stationarity import adf_test, kpss_test

adf_result = adf_test(series)
kpss_result = kpss_test(series)

adf_pval = adf_result[1]
kpss_pval = kpss_result[1]

if adf_pval < 0.05 and kpss_pval > 0.05:
    verdict = "STATIONARY — StandardScaler is appropriate."
elif adf_pval > 0.05 and kpss_pval < 0.05:
    verdict = "NON-STATIONARY — consider differencing or RobustScaler."
elif adf_pval < 0.05 and kpss_pval < 0.05:
    verdict = "TREND-STATIONARY — detrend before fitting."
else:
    verdict = "INCONCLUSIVE — inspect rolling statistics."

stationarity_summary = pd.DataFrame(
    [
        {
            "Metric": "ADF statistic",
            "Value": f"{adf_result[0]:.4f}",
            "Interpretation": "Smaller values support rejecting the unit root null.",
        },
        {
            "Metric": "ADF p-value",
            "Value": f"{adf_pval:.6f}",
            "Interpretation": "✓ reject unit root (stationary)" if adf_pval < 0.05 else "✗ cannot reject unit root",
        },
        {
            "Metric": "KPSS statistic",
            "Value": f"{kpss_result[0]:.4f}",
            "Interpretation": "Larger values indicate stronger evidence against stationarity.",
        },
        {
            "Metric": "KPSS p-value",
            "Value": f"{kpss_pval:.6f}",
            "Interpretation": "✓ cannot reject stationarity" if kpss_pval > 0.05 else "✗ reject stationarity",
        },
        {
            "Metric": "Verdict",
            "Value": verdict,
            "Interpretation": "Combined ADF + KPSS result used to choose scaling or differencing.",
        },
    ]
)

# Display the stationarity test results and the combined decision in a styled table

twiga_gt(
    GT(stationarity_summary)
    .tab_header(
        title=md("**Stationarity Tests — NetLoad**"),
        subtitle="ADF and KPSS together determine whether the series is stationary",
    )
    .cols_label(
        Metric=md("**Metric**"),
        Value=md("**Value**"),
        Interpretation=md("**Interpretation**"),
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(stationarity_summary),
)

5. AMI profile: how far does exploitable information reach?#

ami_h1 anchors the signal strength at the shortest horizon. rel_auc = mean_AMI / AMI(h=1) measures persistence: a high rel_auc means the signal decays slowly across horizons; a low rel_auc means it collapses immediately.

effective_horizon is the first horizon where AMI drops below ami_noise_floor × AMI(h=1) - the practical boundary of the useful forecast window.

from twiga.core.stats.ami import get_ami_profile

ami_horizons, ami_values = get_ami_profile(
    data["NetLoad(kW)"].to_numpy(dtype=float),
    n_neighbors=8,
    max_horizons=336,
)

ami_df = pd.DataFrame(
    {
        "horizon": ami_horizons.tolist(),
        "ami": ami_values.tolist(),
    }
)
p = (
    ggplot(ami_df, aes(x="horizon", y="ami"))
    + geom_line(color=TWIGA_PALETTE[0], size=1.0)
    + geom_point(color=TWIGA_PALETTE[0], size=2.0)
    + labs(x="Forecast horizon (steps)", y="AMI (nats)")
    + ggsize(800, 300)
    + twiga_theme()
)
p

Key concept - AMI vs. ACF

ACF measures linear autocorrelation at lag h. AMI captures any dependence (linear or nonlinear) between past increments and future increments. A series can have low ACF but high AMI, indicating nonlinear structure that linear models will miss.

AMI is computed on first-differenced values to remove level persistence. A series with strong trend will show high level-based AMI even if it contains no exploitable structure - differencing removes this deceptive baseline.

6. Forecastability Profile#

Assembling all findings into a single profile that directly maps to modelling decisions.

profile = {
    "permutation_entropy": round(pe, 4),
    "sample_entropy": round(se, 4),
    "hurst_exponent": round(h, 4),
    "dfa_exponent": round(dfa, 4),
    "is_stationary_adf": bool(adf_pval < 0.05),
    "is_stationary_kpss": bool(kpss_pval > 0.05),
    "dominant_seasonal_lags": dominant_lags,
}

log.info("Forecastability profile:\n%s", json.dumps(profile, indent=2, default=str))

From profile to model choices#

The table below maps each finding from the forecastability profile to a concrete DataPipelineConfig parameter.

profile_decisions = pd.DataFrame(
    {
        "Profile finding": [
            "Strong ACF peak at lag 48",
            "ACF peak at lag 336",
            "H > 0.5 (persistent memory)",
            "Low PE (regular patterns)",
            "Stationary (ADF + KPSS agree)",
            "High xicor for Ghi, hour",
        ],
        "Model / config recommendation": [
            "`lookback_window_size >= 48`, `lags=[1, 48]`",
            "Consider weekly lags: `lags=[1, 48, 336]`",
            "Larger lookback helps; NN models benefit from attention/embedding layers",
            "Even simple ML baselines (LightGBM) should perform well",
            "`StandardScaler()` is sufficient for `input_scaler`",
            "Include in `exogenous_features` and `calendar_features`",
        ],
    }
)

twiga_gt(
    GT(profile_decisions)
    .tab_header(
        title=md("**From Profile to Model Choices**"),
        subtitle="Each finding maps directly to a DataPipelineConfig parameter",
    )
    .cols_label(
        **{
            "Profile finding": md("**Profile finding**"),
            "Model / config recommendation": md("**Model / config recommendation**"),
        }
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(profile_decisions),
)

Concretely, this profile suggests the following starting DataPipelineConfig:

from sklearn.preprocessing import StandardScaler
from twiga.core.config import DataPipelineConfig

data_config = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    forecast_horizon=48,
    lookback_window_size=96,          # >= lag 48, covers 2 days
    lags=[1, 48, 336],                # dominant seasonal lags
    calendar_features=["hour", "day_night"],
    exogenous_features=["Ghi", "Temperature"],
    input_scaler=StandardScaler(),    # stationary series
    period="30min",
)

profile_to_config = pd.DataFrame(
    {
        "Profile finding": [
            "Strong ACF peak at lag 48",
            "ACF peak at lag 336",
            "H > 0.5 (persistent memory)",
            "Low PE (regular patterns)",
            "Stationary (ADF + KPSS agree)",
            "High xicor for Ghi, hour",
        ],
        "Model / config recommendation": [
            "`lookback_window_size >= 48`, `lags=[1, 48]`",
            "Consider weekly lags: `lags=[1, 48, 336]`",
            "Larger lookback helps; NN models benefit from attention/embedding layers",
            "Even simple ML baselines (LightGBM) should perform well",
            "`StandardScaler()` is sufficient for `input_scaler`",
            "Include in `exogenous_features` and `calendar_features`",
        ],
    }
)

twiga_gt(
    GT(profile_to_config)
    .tab_header(
        title=md("**Forecastability Profile → Config Mapping**"),
        subtitle="Translate analysis findings directly into DataPipelineConfig parameters",
    )
    .cols_label(
        **{
            "Profile finding": md("**Profile finding**"),
            "Model / config recommendation": md("**Config recommendation**"),
        },
    )
    .tab_source_note("Twiga Forecast · NB02 — Forecastability Analysis"),
    n_rows=len(profile_to_config),
)

7. Signal Characterisation: One call to rule them all#

The individual steps above (entropy, ACF/PACF, stationarity, AMI) are wrapped into a single SignalCharacteriser that runs all four diagnostic dimensions in one call and returns an immutable CharacterisationResult.

The result’s to_pipeline_hints() method translates the diagnostics directly into DataPipeline constructor kwargs, closing the loop between analysis and modelling.

from twiga.core.data.characterisation import CharacterisationConfig, SignalCharacteriser

# Cap AMI at 96 horizons (one day) to keep runtime short in this notebook.
# Remove ami_max_horizons or increase it for a full profile on your own data.
cfg = CharacterisationConfig(
    n_samples_per_day=48,  # 30-min data → 48 steps per day
    ami_max_horizons=336,  # cap AMI profile at one day of horizons
)

characteriser = SignalCharacteriser(cfg)
result = characteriser.analyse(data, target_column="NetLoad(kW)")

Tabular summary#

summary_df = SignalCharacteriser.summary(result)
summary_df

Key outputs#

print("Stationarity verdict  :", result.stationarity.verdict)
print("Integration order (d) :", result.stationarity.integration_order)
print("AR order (p̂)          :", result.temporal.ar_order)
print("Seasonal periods      :", result.temporal.seasonal_periods)
print("Dominant period       :", result.temporal.dominant_period)
print("Suggested lags        :", result.temporal.suggested_lags)
print()
print("Predictability label  :", result.predictability.label)
print("AMI(h=1)              :", f"{result.predictability.ami_h1:.4f}")
print("rel_auc               :", f"{result.predictability.rel_auc:.4f}")
print("Effective horizon     :", result.predictability.effective_horizon)

Pipeline hints: from characterisation to DataPipeline#

hints = result.to_pipeline_hints()
print("DataPipeline hints:")
for k, v in hints.items():
    print(f"  {k:25s}: {v}")

Wrapping up#

What you did

Computed Permutation Entropy, Sample Entropy, Hurst Exponent, and DFA Exponent to quantify signal complexity
Identified dominant seasonal periods (daily lag 48, weekly lag 336) from the ACF
Estimated the AR order (p = 7) from the PACF cutoff
Ran ADF and KPSS stationarity tests and selected an appropriate scaler
Ranked exogenous features by Xi correlation to decide which to include in the config
Assembled a complete Forecastability Profile and mapped it to DataPipelineConfig parameters

Key takeaways for beginners

Entropy tells you how hard forecasting will be - a high Hurst exponent (H > 0.5) means the series has memory and patterns that a model can exploit.
ACF peaks are your lags - a strong peak at lag 48 means “yesterday at this time is a good predictor”. Set lags and lookback_window_size accordingly.
PACF gives the minimum AR order - lags beyond the PACF cutoff are redundant; including them adds noise but not signal.
Always test stationarity before choosing a scaler - StandardScaler works for stationary series; non-stationary series may need differencing or RobustScaler.
Xi correlation ranks features fairly - it captures non-linear dependencies and is more reliable than Pearson correlation for engineering signals.

What’s next?#

NB03 - Feature Engineering covers:

Constructing DataPipelineConfig from the profile above
Lag and rolling-window feature generation
Temporal and exogenous feature encoding
Train / validation / test splitting with TimeBasedCV

The forecastability profile built here will be used directly to justify every parameter choice in NB03.

# ruff: noqa: E501, E701, E702
from IPython.display import HTML

_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"

steps = [
    {
        "num": "01",
        "title": "Getting Started",
        "desc": "Load data · configure pipeline · train LightGBM · evaluate",
        "tags": ["data", "config", "train"],
        "active": False,
    },
    {
        "num": "02",
        "title": "Forecastability Analysis",
        "desc": "Entropy · ACF · stationarity tests · forecastability profile",
        "tags": ["entropy", "ACF", "stationarity"],
        "active": True,
    },
    {
        "num": "03",
        "title": "Feature Engineering",
        "desc": "Lag, rolling-window, and calendar features",
        "tags": ["features", "lags", "windows"],
        "active": False,
    },
    {
        "num": "04",
        "title": "Time Series Differencing",
        "desc": "Stationarity · first-order and seasonal differencing · inversion",
        "tags": ["differencing", "stationarity"],
        "active": False,
    },
    {
        "num": "05",
        "title": "ML Point Forecasting",
        "desc": "CatBoost · XGBoost · LightGBM · model comparison",
        "tags": ["catboost", "xgboost", "lightgbm"],
        "active": False,
    },
]
track_name = "Beginner Track"
footer = 'Next: build features for your model in <span style="color:#107591;font-weight:600;">03 — Feature Engineering</span>.'


def _b(t, bg, fg):
    return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'


ch = ""
for i, s in enumerate(steps):
    a = s["active"]
    cb = _TEAL if a else _WHITE
    cbo = _TEAL if a else "#d1ecf1"
    nb = _TEAL_MID if a else _TEAL_LIGHT
    nf = _WHITE if a else _TEAL
    tf = _WHITE if a else _TEXT_DARK
    df = "#cce8ef" if a else _TEXT_MUTED
    bb = "#0d5f75" if a else _TEAL_BEST
    bf = "#b8e4ed" if a else _TEAL
    yh = (
        f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
        if a
        else ""
    )
    bdg = "".join(_b(t, bb, bf) for t in s["tags"])
    ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
    if i < len(steps) - 1:
        ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'

HTML(
    f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)