Feature Engineering#

What you’ll build

A fully-engineered feature matrix from the MLVS-PT net-load signal - including lag features, rolling-window statistics, cyclic calendar encodings, and Fourier terms - wired into a DataPipelineConfig that any Twiga forecaster can consume directly.

Prerequisites

Basic Python (lists, dicts, imports)
NB01 - Getting Started
NB02 - Forecastability Analysis (recommended - provides the parameter rationale)

Learning objectives

By the end of this notebook you will be able to:

Add cyclic calendar signals (hour, weekday, month, day/night) using TemporalFeatureTransformer
Build lag features and rolling-window statistics using AutoregressTransformer
Rank and select the most informative features using select_top_features
Encode cyclic variables with Fourier (sin/cos) terms to avoid discontinuities
Wire all feature choices into a DataPipelineConfig and pass it to DataPipeline

The five-step workflow

Raw data  →  Temporal features  →  AR features  →  Feature selection  →  DataPipelineConfig
(parquet)    (calendar signals)    (lags, windows)   (rank & filter)       (pipeline-ready)

Each section in this notebook maps to one step in the pipeline - by the end you will have a complete, reproducible feature-engineering configuration.

1. Setup#

We import four groups of libraries:

warnings: suppress noisy deprecation messages so output stays readable.
great_tables / lets_plot: rendering styled tables and interactive plots inside the notebook.
numpy / pandas: array maths and tabular data manipulation.
Twiga utilities: configure() sets up logging; get_logger() gives us a labelled logger for clean output.
Twiga feature transformers: TemporalFeatureTransformer adds calendar features; AutoregressTransformer builds lag and rolling-window columns.
Twiga plot helpers: plot_timeseries, line_plot, and scatter_matrix are thin wrappers around LetsPlot tuned for time series work.

import warnings

warnings.filterwarnings("ignore")

from great_tables import GT
from lets_plot import LetsPlot, gggrid, ggsize
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler

LetsPlot.setup_html()

from twiga.core.plot import (
    plot_acf,
    plot_metrics_bar,
    plot_timeseries,
)
from twiga.core.plot.gt import twiga_gt
from twiga.core.utils import configure, get_logger

configure()
log = get_logger("tutorials")

# Load dataset — only the columns we need
COLUMNS = ["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]

raw = pd.read_parquet("../data/MLVS-PT.parquet", columns=COLUMNS)
raw["timestamp"] = pd.to_datetime(raw["timestamp"])

# Filter to the study period
data = raw[(raw["timestamp"] >= "2019-01-01") & (raw["timestamp"] <= "2020-12-31")].copy()
data = data.reset_index(drop=True)

log.info(f"Shape : {data.shape}")
log.info(f"Period: {data['timestamp'].min()} -> {data['timestamp'].max()}")
twiga_gt(GT(data.head()))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 4
# Load dataset — only the columns we need
COLUMNS = ["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]

----> 4 raw = pd.read_parquet("../data/MLVS-PT.parquet", columns=COLUMNS)
raw["timestamp"] = pd.to_datetime(raw["timestamp"])

# Filter to the study period

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
   use_nullable_dtypes = False
check_dtype_backend(dtype_backend)
--> 669 return impl.read(
   path,
   columns=columns,
   filters=filters,
   storage_options=storage_options,
   use_nullable_dtypes=use_nullable_dtypes,
   dtype_backend=dtype_backend,
   filesystem=filesystem,
   **kwargs,
)

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
if manager == "array":
   to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
   path,
   filesystem,
   storage_options=storage_options,
   mode="rb",
)
try:
   pa_table = self.api.parquet.read_table(
       path_or_handle,
       columns=columns,
   (...)    270         **kwargs,
   )

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
handles = None
if (
   not fs
   and not is_dir
   (...)    139     # fsspec resources can also point to directories
   # this branch is used for example when reading from non-fsspec URLs
--> 141     handles = get_handle(
       path_or_handle, mode, is_text=False, storage_options=storage_options
   )
   fs = None
   path_or_handle = handles.handle

File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
       handle = open(
           handle,
           ioargs.mode,
   (...)    878             newline="",
       )
   else:
       # Binary mode
--> 882         handle = open(handle, ioargs.mode)
   handles.append(handle)
# Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'

2. Temporal Features#

Key concept - Calendar features

A time series model sees numbers, not timestamps. Calendar features translate the human notion of time (“it’s 8 am on a Monday”) into numbers a model can use as inputs. The trick is to encode cyclic variables (hour 23 is close to hour 0, not far from it) using sin/cos pairs so the model sees the circular structure rather than a linear scale.

TemporalFeatureTransformer adds calendar signals to a DataFrame. It wraps three building blocks:

Feature	Type	Description
`hour`	trigonometric (sin/cos)	Hour of day (0 - 23)
`wday`	trigonometric (sin/cos)	Day of week (0 - 6)
`month`	trigonometric (sin/cos)	Month of year (1 - 12)
`day_night`	binary	1 = daytime, 0 = night

day_night and solar geometry: Twiga uses the astral library to compute accurate sunrise/sunset times per day for the specified location.
Latitude and longitude are therefore required whenever day_night is listed in calendar_features. The Madeira Island coordinates below correspond to the MLVS-PT measurement site.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

temporal_ref = pd.DataFrame(
    {
        "Feature": ["`hour`", "`wday`", "`month`", "`day_night`"],
        "Type": ["trigonometric (sin/cos)", "trigonometric (sin/cos)", "trigonometric (sin/cos)", "binary"],
        "Description": ["Hour of day (0–23)", "Day of week (0–6)", "Month of year (1–12)", "1 = daytime, 0 = night"],
        "Why it matters": [
            "Captures the daily demand cycle — morning peak, midday lull, evening peak",
            "Captures weekend vs. weekday demand differences",
            "Captures seasonal heating/cooling patterns across the year",
            "Direct proxy for solar generation (GHI ≈ 0 at night)",
        ],
    }
)

twiga_gt(
    GT(temporal_ref)
    .tab_header(
        title=md("**Calendar Features — Reference Guide**"),
        subtitle="All cyclic features are encoded as sin/cos pairs to preserve circular structure",
    )
    .cols_label(
        Feature=md("**Feature**"),
        Type=md("**Type**"),
        Description=md("**Description**"),
        **{"Why it matters": md("**Why it matters**")},
    )
    .tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
    n_rows=len(temporal_ref),
)

Key concept - Temporal features

Raw timestamps carry rich information, but a model cannot interpret a datetime object directly. Temporal features extract the useful dimensions:

Calendar features (hour, weekday, month) - tell the model where in the daily or weekly cycle a sample falls. Net load is typically higher on weekday mornings than Sunday nights; a model with no hour feature cannot learn this.

Day/night flag - Madeira Island receives meaningful solar irradiance only during daylight. A binary day/night column lets even a simple tree model split on light vs. dark without needing to reason about solar angles.

Solar-angle features - if you pass latitude and longitude, TemporalFeatureTransformer computes the exact solar elevation angle for each timestamp. This is more precise than a fixed day/night cutoff and captures seasonal variation in sunrise/sunset times.

All of these are computed once during fit_transform() and then reproduced identically at prediction time - ensuring train/test consistency.

from twiga.core.data.temporal import TemporalFeatureTransformer

temporal = TemporalFeatureTransformer(
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night", "wday", "month"],
)

data_with_temporal = temporal.fit_transform(data.copy())

# Which columns were added?
original_cols = set(data.columns)
new_cols = [c for c in data_with_temporal.columns if c not in original_cols]
log.info("Original columns : %s", list(original_cols))
log.info("New columns added: %s", new_cols)

# Inspect the first few rows of the new temporal columns
data_with_temporal[["timestamp", "NetLoad(kW)"] + new_cols].head(8)

# Visualise day_night assignment over a single week
one_week = data_with_temporal[
    (data_with_temporal["timestamp"] >= "2020-06-15") & (data_with_temporal["timestamp"] < "2020-06-22")
].copy()

p = plot_timeseries(
    one_week,
    y_cols=["NetLoad(kW)"],
    date_col="timestamp",
    band_col="day_night",
    band_labels={0: "Night", 1: "Day"},
    title="NetLoad(kW) with day/night context — one week (Jun 2020)",
    y_label="NetLoad (kW)",
    x_label="Date",
)
p

3. Autoregressive Features#

Key concept - Lag features

A lag feature is simply the value of the target variable from some number of steps in the past. If electricity demand follows a daily pattern, then “what was demand exactly 24 hours ago?” is an extremely useful input to predict “what will demand be now?”. Lag features let a machine-learning model exploit this temporal memory without you having to design a recurrence mechanism from scratch.

Key concept - Rolling windows

A rolling-window feature summarises recent history into a single number - the mean, standard deviation, or another aggregate over a sliding window. Where a lag feature answers “what was the value at time t−k?”, a rolling mean answers “what was the typical value over the last k steps?”. Rolling statistics smooth out noise and capture trend shifts that individual lags might miss.

AutoregressTransformer creates two families of features:

Lag features: value of the target n_samples * lag timesteps ago
Rolling features: aggregate (mean, std, …) over a window of n_samples * window timesteps

Understanding the `n_samples` multiplier#

With 30-minute data there are 48 samples per day (n_samples=48).
Specifying lags=[1, 2, 7] therefore produces lags at:

`lag` value	Actual shift	Meaning
1	1 × 48 = 48 steps	1 day ago
2	2 × 48 = 96 steps	2 days ago
7	7 × 48 = 336 steps	7 days ago (same weekday)

Similarly, windows=[1, 2, 7] produces rolling windows of 48, 96, and 336 steps.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

lag_ref = pd.DataFrame(
    {
        "lag value": ["1", "2", "7"],
        "Actual shift (30-min data)": ["1 × 48 = 48 steps", "2 × 48 = 96 steps", "7 × 48 = 336 steps"],
        "Meaning": ["1 day ago", "2 days ago", "7 days ago (same weekday)"],
        "Why useful": [
            "Yesterday's pattern is the strongest single predictor for today",
            "Short-term trend — captures whether demand is rising or falling",
            "Same time last week — accounts for weekly seasonality",
        ],
    }
)

twiga_gt(
    GT(lag_ref)
    .tab_header(
        title=md("**Lag Multiplier Reference — 30-min data, `n_samples=48`**"),
        subtitle="Each lag value is multiplied by n_samples to get the actual timestep shift",
    )
    .cols_label(
        **{
            "lag value": md("**`lag` value**"),
            "Actual shift (30-min data)": md("**Actual shift**"),
            "Meaning": md("**Meaning**"),
            "Why useful": md("**Why useful**"),
        }
    )
    .tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
    n_rows=len(lag_ref),
)

Key concept - Lag features and rolling windows

Tree-based models (LightGBM, XGBoost, CatBoost) cannot process sequences - they see one flat feature vector per sample. To give them access to the past, we engineer explicit historical features:

Lag features - copy the target value from n_samples × lag steps ago. For 30-min data, lag=1 means 1 day ago (48 steps), lag=7 means 1 week ago (336 steps). The forecastability analysis in NB02 showed strong ACF peaks at lags 48 and 336 - those are the lags worth including.

Rolling-window statistics - summarise a window of recent values into a single number (mean, std, min, max). A rolling mean smooths short-term noise; a rolling std captures volatility. Window size should be at least as large as the dominant seasonal period.

Both features are aligned so that at time t the model only sees values from before time t - there is no look-ahead leakage. AutoregressTransformer handles this alignment automatically.

from twiga.core.data import AutoregressTransformer

auto_res = AutoregressTransformer(
    n_samples=48,
    lags=[1, 2, 7],  # 1 day, 2 days, 7 days ago
    windows=[1, 2, 7],  # rolling over 1, 2, 7 days
    window_funcs=["mean", "std"],
    value_column="NetLoad(kW)",
)

data_with_ar = auto_res.fit_transform(data.copy())

log.info(f"Rows before: {len(data):,}  |  Rows after: {len(data_with_ar):,}")
log.info(f"(dropped {len(data) - len(data_with_ar):,} rows to remove NaN warm-up period)")

# Inspect the lag columns generated
lag_cols = [c for c in data_with_ar.columns if "lag" in c]
rolling_cols = [c for c in data_with_ar.columns if "rolling" in c]

log.info("Lag columns    : %s", lag_cols)
log.info("Rolling columns: %s", rolling_cols)

# Show values for a few rows to verify alignment
data_with_ar[["timestamp", "NetLoad(kW)"] + lag_cols + rolling_cols[:4]].head(6)

from twiga.core.data.relevance import AssociationAnalyzer

analyzer = AssociationAnalyzer()
# Xi correlation: exogenous variables vs. NetLoad

feature_cols = ["Ghi", "Temperature", "hour", "day_night"] + lag_cols + rolling_cols

# Keep only columns that actually exist after transformation
feature_cols = [c for c in feature_cols if c in data_with_ar.columns]
plots = []
for method in ["pearson", "spearman", "kendall", "xicor", "pps", "mi", "anova"]:
    xicor_df = AssociationAnalyzer.compute(
        data=data_with_ar,
        variable_cols=feature_cols,
        target_col="NetLoad(kW)",
        method=method,
    )

    p = plot_metrics_bar(
        xicor_df,
        metric_col="score",  # This matches the 'score' column from AssociationAnalyzer
        model_col="feature",  # This matches the 'feature' column (your lags/rolling names)
        lower_is_better=False,
        title=f"{method.upper()}",
        x_label=" Score",
        horizontal=True,
        font_size=10,
    )
    plots.append(p)

gggrid(plots, ncol=2) + ggsize(1220, 1000)

# Quick correlation bar chart — lag and rolling features vs target
corr_data = data_with_ar[["NetLoad(kW)"] + lag_cols + rolling_cols].dropna()
corr_vals = corr_data.corr()["NetLoad(kW)"].drop("NetLoad(kW)").reset_index()
corr_vals.columns = ["Model", "Correlation"]

p = plot_metrics_bar(
    corr_vals,
    metric_col="Correlation",
    model_col="Model",
    lower_is_better=False,
    title="Pearson correlation of AR features with NetLoad(kW)",
    x_label="Correlation",
    horizontal=True,
)
p

4. Feature Selection#

Key concept - Feature matrix

After applying temporal and autoregressive transformers, we have a feature matrix - a table where each row is one timestep and each column is one input signal the model will see. Feature matrices can easily have 50 - 200 columns, many of which are redundant or noisy. Feature selection trims this down to the most informative subset, reducing overfitting risk and training time.

select_top_features ranks candidate features using a multi-metric ensemble:

Pearson correlation (absolute value)
ANOVA F-score (linear separability)
Mutual information (non-linear dependency)
Random Forest importance (tree-based, captures interactions)

Individual ranks are aggregated via Borda count and the top-k features are returned.
This multi-metric approach is more robust than any single score, especially for non-linear time series patterns.

from twiga.core.data import select_top_features

# Drop rows with NaN before selection
ar_clean = data_with_ar.dropna().copy()

lag_cols = [c for c in ar_clean.columns if "lag" in c]
rolling_cols = [c for c in ar_clean.columns if "rolling" in c]

top_lags = select_top_features(
    data=ar_clean,
    features=lag_cols,
    target="NetLoad(kW)",
    top_k=2,
)

top_rolling = select_top_features(
    data=ar_clean,
    features=rolling_cols,
    target="NetLoad(kW)",
    top_k=2,
)

log.info("Top lag features    : %s", top_lags)
log.info("Top rolling features: %s", top_rolling)

# Visualise the selected lag features side-by-side with the target
sample = ar_clean[(ar_clean["timestamp"] >= "2019-03-01") & (ar_clean["timestamp"] < "2019-03-08")].copy()

p = plot_timeseries(
    sample,
    y_cols=["NetLoad(kW)"] + top_lags,
    date_col="timestamp",
    title="Top selected lag features vs target (one week)",
    y_label="kW",
    x_label="Date",
    series_line_size=0.8,
    fig_size=(960, 300),
)
p

5. Fourier Features#

Calendar variables like hour are cyclic: hour 23 is closer to hour 0 than to hour 12, but a plain integer does not express this.
Fourier encoding projects each value onto a unit circle:

\[ \text{hour\_sin} = \sin\!\left(\frac{2\pi \cdot h}{24}\right), \quad \text{hour\_cos} = \cos\!\left(\frac{2\pi \cdot h}{24}\right) \]

A third convenience column hour_cosin = hour_sin + hour_cos is also added.

TemporalFeatureTransformer applies this automatically for trigonometric features (hour, wday, month).
add_fourier_features lets you do it manually for any column.

Key concept - Fourier encoding for cyclic variables

Calendar variables like hour (0 - 23) and month (1 - 12) are cyclic: hour 23 is just one step away from hour 0, but the integer 23 is far from 0 in Euclidean space. A tree model or neural network that receives raw integers will not naturally understand this wrap-around.

Fourier encoding projects each value onto a unit circle using sine and cosine:
sin_value = sin(2π × value / period)
cos_value = cos(2π × value / period)
For hour with period 24, hours 0 and 23 land at nearly the same point on the circle. The model receives two continuous numbers (sin, cos) per cyclic feature instead of a raw integer - the circular distance is now correctly represented.

Use Fourier encoding whenever a calendar variable has natural wrap-around: hour (period 24), weekday (period 7), month (period 12), day-of-year (period 365).

from twiga.core.data import add_fourier_features

# We start from data_with_temporal which already has the 'hour' integer column
data_fourier = add_fourier_features(
    data_with_temporal.copy(),
    calendar_variables=["hour"],
    periods=[24],
)

fourier_cols = ["hour", "hour_sin", "hour_cos", "hour_cosin"]
log.info("Fourier columns added:")
data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").head(6)

# Visualise the sin/cos encoding of hour over one full day
one_day = data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").copy()
one_day["hour_step"] = one_day["hour"].astype(int)

p = plot_timeseries(
    one_day,
    y_cols=["hour_sin", "hour_cos", "hour_cosin"],
    date_col="hour_step",
    title="Fourier encoding of 'hour' (period = 24)",
    y_label="Fourier value",
    x_label="Hour of day",
    fig_size=(300, 250),
    series_line_size=0.8,
)
p

6. Connecting to `DataPipelineConfig`#

DataPipelineConfig is a Pydantic model that acts as the single source of truth for everything the pipeline needs to know.
The selected features from Section 4 plug directly into historical_features.

Key concept - The feature matrix

DataPipeline transforms a raw DataFrame into a pair of 3-D NumPy arrays:

X - shape (n_samples, lookback_window_size, n_features) - the input tensor fed to the model. Each sample is a window of lookback_window_size timesteps, each with n_features values (target history + calendar + exogenous + lag + rolling columns).

y - shape (n_samples, forecast_horizon, n_targets) - the corresponding future target values the model must predict.

This 3-D format is the universal interface between feature engineering and all Twiga models (ML and NN alike). DataPipeline.fit() learns the scalers from training data; DataPipeline.transform() applies them to any split without refitting - preventing data leakage.

from twiga.core.config import DataPipelineConfig

# Two ways to specify autoregressive features in DataPipelineConfig:
#
#   Option A — let the pipeline compute them (standard path with TwigaForecaster):
#     lags=[1,7,48], windows=[1,7], window_funcs=["mean"]
#     → AutoregressTransformer runs internally; pass raw DataFrames to fit/predict
#
#   Option B — pre-compute externally and name the columns:
#     historical_features=["lag_48", "lag_336", ...]
#     → you must pass the enriched DataFrame (with those columns) everywhere
#
# For TwigaForecaster (the standard path) always use Option A.
data_config = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    period="30min",
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night"],
    exogenous_features=["Ghi"],
    lags=[1, 7, 48],  # pipeline computes lag_48, lag_336, lag_2304
    windows=[1, 7],  # rolling windows of 48 and 336 steps
    window_funcs=["mean"],
    forecast_horizon=48,
    lookback_window_size=96,
    input_scaler=RobustScaler(),
    target_scaler=RobustScaler(),
)

log.info("\n%s", data_config.model_dump(exclude={"input_scaler", "target_scaler"}))

# Summarise what the config holds
log.info("=== DataPipelineConfig summary ===")
log.info(f"  Target        : {data_config.target_feature}")
log.info(f"  Period        : {data_config.period}")
log.info(f"  Horizon       : {data_config.forecast_horizon} steps")
log.info(f"  Lookback      : {data_config.lookback_window_size} steps")
log.info(f"  Calendar feats: {data_config.calendar_features}")
log.info(f"  Exogenous     : {data_config.exogenous_features}")
log.info(f"  Historical    : {data_config.historical_features}")
log.info(f"  Input scaler  : {type(data_config.input_scaler).__name__}")
log.info(f"  Target scaler : {type(data_config.target_scaler).__name__}")

Validation at construction time#

Because DataPipelineConfig is a Pydantic model, invalid values are rejected immediately. Try uncommenting the cell below to see:

# Uncomment to see Pydantic validation in action
# from twiga.core.config import DataPipelineConfig
# bad_config = DataPipelineConfig(
#     target_feature="NetLoad(kW)",
#     period="not-a-period",   # <-- invalid
#     forecast_horizon=48,
#     lookback_window_size=96,
# )

7. How `DataPipeline` transforms internally#

DataPipeline is a scikit-learn TransformerMixin that builds a processing chain from a DataPipelineConfig.
Internally it:

Runs TemporalFeatureTransformer (if calendar_features are set)
Runs AutoregressTransformer (if lags or windows are set)
Applies input_scaler to numerical features and target_scaler to the target
Slices the scaled data into overlapping (lookback, horizon) sequence windows

The output X has shape (n_samples, lookback_window_size, n_features) and
y has shape (n_samples, forecast_horizon, n_targets) - ready to feed directly into a neural network.

from twiga.core.data import DataPipeline

# DataPipeline takes the same individual parameters as DataPipelineConfig fields.
# Using lags/windows/window_funcs means the pipeline creates AR features internally
# — pass the raw DataFrame (no pre-computation needed).
pipeline = DataPipeline(
    target_feature=data_config.target_feature,
    period=data_config.period,
    lookback_window_size=data_config.lookback_window_size,
    forecast_horizon=data_config.forecast_horizon,
    latitude=data_config.latitude,
    longitude=data_config.longitude,
    calendar_features=data_config.calendar_features,
    exogenous_features=data_config.exogenous_features,
    lags=data_config.lags,
    windows=data_config.windows,
    window_funcs=data_config.window_funcs,
    input_scaler=data_config.input_scaler,
    target_scaler=data_config.target_scaler,
)

# Pass raw train_df — pipeline handles all feature computation internally
train_df = data[data["timestamp"] < "2020-01-01"].copy()
pipeline.fit(train_df)
log.info("Pipeline fitted successfully.")

X, y = pipeline.transform(train_df)

log.info("X shape: %s", X.shape)  # (n_samples, lookback_window_size, n_features)
log.info("y shape: %s", y.shape)  # (n_samples, forecast_horizon, n_targets)
log.info("")
log.info(f"Each sample provides {X.shape[1]} lookback steps and {y.shape[1]} forecast steps.")
log.info(f"The model sees {X.shape[2]} input features per timestep.")

# Visualise a single training window
SAMPLE_IDX = 100  # arbitrary sample

lookback_vals = X[SAMPLE_IDX, :, 0].tolist()
horizon_vals = y[SAMPLE_IDX, :, 0].tolist()

n_lookback = X.shape[1]
n_horizon = y.shape[1]

window_df = pd.DataFrame(
    {
        "step": list(range(n_lookback)) + list(range(n_lookback, n_lookback + n_horizon)),
        "lookback": lookback_vals + [None] * n_horizon,
        "horizon": [None] * n_lookback + horizon_vals,
    }
)

p = plot_timeseries(
    window_df,
    y_cols=["lookback", "horizon"],
    date_col="step",
    title=f"Training window #{SAMPLE_IDX}: lookback ({n_lookback}) + horizon ({n_horizon})",
    y_label="Scaled value",
    x_label="Timestep",
    fig_size=(900, 300),
)
p

8. Summary#

In this notebook you learned how to:

Step	Tool	Output
Temporal features	`TemporalFeatureTransformer`	hour_sin/cos, wday_sin/cos, month_sin/cos, day_night
Autoregressive features	`AutoregressTransformer`	lag_NNN, rolling_NNN_mean/std
Feature selection	`select_top_features`	top-k ranked feature names
Fourier encoding	`add_fourier_features`	sin/cos columns for any cyclic variable
Pipeline wiring	`DataPipelineConfig` + `DataPipeline`	3-D arrays (n_samples, lookback, features) ready for a model

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

summary_ref = pd.DataFrame(
    {
        "Step": [
            "Temporal features",
            "Autoregressive features",
            "Feature selection",
            "Fourier encoding",
            "Pipeline wiring",
        ],
        "Tool": [
            "`TemporalFeatureTransformer`",
            "`AutoregressTransformer`",
            "`select_top_features`",
            "`add_fourier_features`",
            "`DataPipelineConfig` + `DataPipeline`",
        ],
        "Output": [
            "hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
            "lag_NNN, rolling_NNN_mean/std",
            "top-k ranked feature names",
            "sin/cos columns for any cyclic variable",
            "3-D arrays (n_samples, lookback, features) ready for a model",
        ],
    }
)

twiga_gt(
    GT(summary_ref)
    .tab_header(
        title=md("**NB03 — Feature Engineering Summary**"),
        subtitle="Every step produces a concrete artifact consumed by the next step",
    )
    .cols_label(
        Step=md("**Step**"),
        Tool=md("**Tool**"),
        Output=md("**Output**"),
    )
    .tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
    n_rows=len(summary_ref),
)

from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

summary_df = pd.DataFrame(
    {
        "Step": [
            "Temporal features",
            "Autoregressive features",
            "Feature selection",
            "Fourier encoding",
            "Pipeline wiring",
        ],
        "Tool": [
            "`TemporalFeatureTransformer`",
            "`AutoregressTransformer`",
            "`select_top_features`",
            "`add_fourier_features`",
            "`DataPipelineConfig` + `DataPipeline`",
        ],
        "Output": [
            "hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
            "lag_NNN, rolling_NNN_mean/std",
            "top-k ranked feature names",
            "sin/cos columns for any cyclic variable",
            "3-D arrays (n_samples, lookback, features) ready for a model",
        ],
        "Config parameter": [
            "`calendar_features`, `latitude`, `longitude`",
            "`lags`, `windows`, `window_funcs`",
            "`historical_features`",
            "Apply before passing to pipeline",
            "`DataPipelineConfig` fields",
        ],
    }
)

twiga_gt(
    GT(summary_df)
    .tab_header(
        title=md("**NB03 — Feature Engineering Summary**"),
        subtitle="Five steps from raw DataFrame to model-ready arrays",
    )
    .cols_label(
        Step=md("**Step**"),
        Tool=md("**Tool**"),
        Output=md("**Output**"),
        **{"Config parameter": md("**Config parameter**")},
    )
    .tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
    n_rows=len(summary_df),
)

9. EDA: Visual Feature Inspection#

Before passing engineered features to a model it is worth spending a few minutes on exploratory visualisation. The three helpers below give fast answers to common pre-modelling questions:

Question	Tool
Does the raw signal look as expected?	`line_plot` - any 1-D window, no DataFrame needed
Do my covariates correlate with the target?	`scatter_plot` - colour-encoded scatter + LOESS trend
Are lag features jointly informative?	`scatter_matrix` - pair-plot grid across features and target

All three accept the same Twiga theme parameters (font_size, grid, legend_pos, …) as every other plotting utility in the library.

from twiga.core.plot import line_plot, scatter_matrix, scatter_plot

`line_plot`: one-week load profile#

line_plot accepts a plain Python list or 1-D NumPy array - no DataFrame required. It is the fastest way to sanity-check a raw signal, inspect seasonality, or verify that a transformation (differencing, scaling) behaved as expected.

When to use it: quick visual check on any univariate sequence before or after feature engineering.

# line_plot takes a plain 1-D sequence — no DataFrame needed
sample_week = data_with_temporal[
    (data_with_temporal["timestamp"] >= "2019-06-01") & (data_with_temporal["timestamp"] < "2019-06-08")
]["NetLoad(kW)"].values

p = line_plot(
    x=None,
    y=sample_week,
    title="Net Load — one week (Jun 2019)",
    y_label="Net Load (kW)",
    x_label="30-min step",
    fig_size=(900, 300),
)
p

`scatter_plot`: GHI vs Net Load by day/night#

scatter_plot renders a 2-D scatter with optional colour grouping and a LOESS smoothing curve. It is ideal for assessing whether a covariate has a meaningful (possibly non-linear) relationship with the target before committing to feature selection.

The hue_col parameter expects a string column. Numeric flags like day_night should be mapped first - see the .assign(group=…) call below.

When to use it: check covariate - target correlations and detect regime differences (e.g. day vs night, weekday vs weekend) that a model should learn.

scatter_df = (
    data_with_temporal[["Ghi", "NetLoad(kW)", "day_night"]]
    .dropna()
    .rename(columns={"Ghi": "cycle", "NetLoad(kW)": "value", "day_night": "group"})
    .assign(group=lambda df: df["group"].map({0: "Night", 1: "Day"}))
)

p = scatter_plot(
    scatter_df,
    x_col="cycle",
    y_col="value",
    color_col="group",
    title="GHI vs Net Load — coloured by day / night",
)
p

`scatter_matrix`: lag features vs target#

scatter_matrix renders a pair-plot grid across the selected feature columns and the target. Each off-diagonal cell is a scatter; the diagonal shows the marginal distribution. Colour groups (e.g. day / night) make regime-specific patterns immediately visible.

Use this after AutoregressTransformer to confirm that lag features carry signal and to spot multicollinearity between lags before training.

When to use it: validate that autoregressive features are jointly informative and identify any redundant lags you might want to drop.

matrix_df = (
    data_with_ar.join(data_with_temporal[["day_night"]], how="left")
    .dropna()
    .assign(group=lambda df: df["day_night"].map({0: "Night", 1: "Day"}))
)

lag_cols = [c for c in matrix_df.columns if "_lag_" in c][:3]

p = scatter_matrix(
    matrix_df,
    variables=lag_cols,
    targets=["NetLoad(kW)"],
    hue_col="group",
    n_sample=1000,
    title="Lag features vs Net Load",
    sort_by_variance=True,
    diag="density",
    fig_size=(500, 500),
)
p

Wrapping up#

What you did

Added cyclic calendar signals (hour, weekday, month, day/night) using TemporalFeatureTransformer
Built lag features and rolling-window statistics using AutoregressTransformer
Ranked and selected the most informative features using select_top_features
Encoded cyclic variables with Fourier (sin/cos) terms to eliminate boundary discontinuities
Wired all feature choices into a DataPipelineConfig and verified the pipeline output shape

Key takeaways for beginners

Calendar features must be encoded cyclically - plain integer hours (0 - 23) tell the model that hour 23 and hour 0 are far apart, which is wrong. Sin/cos encoding places them correctly on a circle.
Lag = yesterday at this time - a lag-1 feature on 30-min data is 48 steps back (24 hours). Always multiply your “intuitive” lag by n_samples.
Rolling means smooth out noise - individual lag values can be noisy; rolling means give the model a stable picture of recent average demand.
Feature selection prevents overfitting - more features is not always better. select_top_features uses four complementary metrics to keep only columns that genuinely predict the target.
DataPipelineConfig is the single source of truth - all feature choices live in one Pydantic config object, making experiments reproducible and easy to share.

What’s next?#

NB04 - ML Point Forecasting shows how to pass DataPipelineConfig into TwigaForecaster, select an ML model (LightGBM, XGBoost, CatBoost, or Linear Regression), and evaluate point predictions using the built-in metrics module.

# ruff: noqa: E501, E701, E702
from IPython.display import HTML

_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"

steps = [
    {
        "num": "01",
        "title": "Getting Started",
        "desc": "Load data · configure pipeline · train LightGBM · evaluate",
        "tags": ["data", "config", "train"],
        "active": False,
    },
    {
        "num": "02",
        "title": "Forecastability Analysis",
        "desc": "Entropy · ACF · stationarity tests",
        "tags": ["entropy", "ACF", "stationarity"],
        "active": False,
    },
    {
        "num": "03",
        "title": "Feature Engineering",
        "desc": "Lag, rolling-window, and calendar features; feature matrix inspection",
        "tags": ["features", "lags", "windows", "calendar"],
        "active": True,
    },
    {
        "num": "04",
        "title": "Time Series Differencing",
        "desc": "Stationarity · first-order and seasonal differencing · inversion",
        "tags": ["differencing", "stationarity"],
        "active": False,
    },
    {
        "num": "05",
        "title": "ML Point Forecasting",
        "desc": "CatBoost · XGBoost · LightGBM · model comparison",
        "tags": ["catboost", "xgboost", "lightgbm"],
        "active": False,
    },
]
track_name = "Beginner Track"
footer = 'Next: handle non-stationarity in <span style="color:#107591;font-weight:600;">04 — Time Series Differencing</span>, then build your first multi-model comparison in <span style="color:#107591;font-weight:600;">05 — ML Point Forecasting</span>.'


def _b(t, bg, fg):
    return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'


ch = ""
for i, s in enumerate(steps):
    a = s["active"]
    cb = _TEAL if a else _WHITE
    cbo = _TEAL if a else "#d1ecf1"
    nb = _TEAL_MID if a else _TEAL_LIGHT
    nf = _WHITE if a else _TEAL
    tf = _WHITE if a else _TEXT_DARK
    df = "#cce8ef" if a else _TEXT_MUTED
    bb = "#0d5f75" if a else _TEAL_BEST
    bf = "#b8e4ed" if a else _TEAL
    yh = (
        f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
        if a
        else ""
    )
    bdg = "".join(_b(t, bb, bf) for t in s["tags"])
    ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
    if i < len(steps) - 1:
        ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'

HTML(
    f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)