Feature Engineering#

Author: Anthony Faustine, sambaiga@gmail.com

Level Python Twiga Time


What you’ll build

A fully-engineered feature matrix from the MLVS-PT net-load signal - including lag features, rolling-window statistics, cyclic calendar encodings, and Fourier terms - wired into a DataPipelineConfig that any Twiga forecaster can consume directly.

Prerequisites

  • Basic Python (lists, dicts, imports)

  • NB01 - Getting Started

  • NB02 - Forecastability Analysis (recommended - provides the parameter rationale)

Learning objectives

By the end of this notebook you will be able to:

  1. Add cyclic calendar signals (hour, weekday, month, day/night) using TemporalFeatureTransformer

  2. Build lag features and rolling-window statistics using AutoregressTransformer

  3. Rank and select the most informative features using select_top_features

  4. Encode cyclic variables with Fourier (sin/cos) terms to avoid discontinuities

  5. Wire all feature choices into a DataPipelineConfig and pass it to DataPipeline

The five-step workflow

Raw data  →  Temporal features  →  AR features  →  Feature selection  →  DataPipelineConfig
(parquet)    (calendar signals)    (lags, windows)   (rank & filter)       (pipeline-ready)

Each section in this notebook maps to one step in the pipeline - by the end you will have a complete, reproducible feature-engineering configuration.

1. Setup#

We import four groups of libraries:

  • warnings: suppress noisy deprecation messages so output stays readable.

  • great_tables / lets_plot: rendering styled tables and interactive plots inside the notebook.

  • numpy / pandas: array maths and tabular data manipulation.

  • Twiga utilities: configure() sets up logging; get_logger() gives us a labelled logger for clean output.

  • Twiga feature transformers: TemporalFeatureTransformer adds calendar features; AutoregressTransformer builds lag and rolling-window columns.

  • Twiga plot helpers: plot_timeseries, line_plot, and scatter_matrix are thin wrappers around LetsPlot tuned for time series work.

import warnings

warnings.filterwarnings("ignore")

from great_tables import GT
from lets_plot import LetsPlot, gggrid, ggsize
import numpy as np
import pandas as pd

LetsPlot.setup_html()

from twiga.core.plot import (
    plot_acf,
    plot_metrics_bar,
    plot_timeseries,
)
from twiga.core.plot.gt import twiga_gt
from twiga.core.utils import configure, get_logger

configure()
log = get_logger("tutorials")
# Load dataset  -  only the columns we need
COLUMNS = ["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]

raw = pd.read_parquet("../data/MLVS-PT.parquet", columns=COLUMNS)
raw["timestamp"] = pd.to_datetime(raw["timestamp"])

# Filter to the study period
data = raw[(raw["timestamp"] >= "2019-01-01") & (raw["timestamp"] <= "2020-12-31")].copy()
data = data.reset_index(drop=True)

log.info(f"Shape : {data.shape}")
log.info(f"Period: {data['timestamp'].min()} -> {data['timestamp'].max()}")
twiga_gt(GT(data.head()))
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Shape : (33553, 4)
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Period: 2019-02-01 00:00:00+00:00 -> 2020-12-31 00:00:00+00:00
timestamp NetLoad(kW) Ghi Temperature
2019-02-01 00:00:00+00:00 38.37412615882276 0.0 17.2
2019-02-01 00:30:00+00:00 39.058097469929656 0.0 17.2
2019-02-01 01:00:00+00:00 38.37412615882276 0.0 17.2
2019-02-01 01:30:00+00:00 36.593270477039624 0.0 17.4
2019-02-01 02:00:00+00:00 36.23178104474715 0.0 17.5

2. Temporal Features#

Key concept - Calendar features

A time series model sees numbers, not timestamps. Calendar features translate the human notion of time (“it’s 8 am on a Monday”) into numbers a model can use as inputs. The trick is to encode cyclic variables (hour 23 is close to hour 0, not far from it) using sin/cos pairs so the model sees the circular structure rather than a linear scale.

TemporalFeatureTransformer adds calendar signals to a DataFrame. It wraps three building blocks:

Feature

Type

Description

hour

trigonometric (sin/cos)

Hour of day (0 - 23)

wday

trigonometric (sin/cos)

Day of week (0 - 6)

month

trigonometric (sin/cos)

Month of year (1 - 12)

day_night

binary

1 = daytime, 0 = night

day_night and solar geometry: Twiga uses the astral library to compute accurate sunrise/sunset times per day for the specified location.
Latitude and longitude are therefore required whenever day_night is listed in calendar_features. The Madeira Island coordinates below correspond to the MLVS-PT measurement site.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

temporal_ref = pd.DataFrame(
    {
        "Feature": ["`hour`", "`wday`", "`month`", "`day_night`"],
        "Type": ["trigonometric (sin/cos)", "trigonometric (sin/cos)", "trigonometric (sin/cos)", "binary"],
        "Description": ["Hour of day (0–23)", "Day of week (0–6)", "Month of year (1–12)", "1 = daytime, 0 = night"],
        "Why it matters": [
            "Captures the daily demand cycle  -  morning peak, midday lull, evening peak",
            "Captures weekend vs. weekday demand differences",
            "Captures seasonal heating/cooling patterns across the year",
            "Direct proxy for solar generation (GHI ≈ 0 at night)",
        ],
    }
)

twiga_gt(
    GT(temporal_ref)
    .tab_header(
        title=md("**Calendar Features  -  Reference Guide**"),
        subtitle="All cyclic features are encoded as sin/cos pairs to preserve circular structure",
    )
    .cols_label(
        Feature=md("**Feature**"),
        Type=md("**Type**"),
        Description=md("**Description**"),
        **{"Why it matters": md("**Why it matters**")},
    )
    .tab_source_note("Twiga Forecast · NB03  -  Feature Engineering"),
    n_rows=len(temporal_ref),
)
Calendar Features - Reference Guide
All cyclic features are encoded as sin/cos pairs to preserve circular structure
Feature Type Description Why it matters
`hour` trigonometric (sin/cos) Hour of day (0–23) Captures the daily demand cycle - morning peak, midday lull, evening peak
`wday` trigonometric (sin/cos) Day of week (0–6) Captures weekend vs. weekday demand differences
`month` trigonometric (sin/cos) Month of year (1–12) Captures seasonal heating/cooling patterns across the year
`day_night` binary 1 = daytime, 0 = night Direct proxy for solar generation (GHI ≈ 0 at night)
Twiga Forecast · NB03 - Feature Engineering

Key concept - Temporal features

Raw timestamps carry rich information, but a model cannot interpret a datetime object directly. Temporal features extract the useful dimensions:

  • Calendar features (hour, weekday, month) - tell the model where in the daily or weekly cycle a sample falls. Net load is typically higher on weekday mornings than Sunday nights; a model with no hour feature cannot learn this.

  • Day/night flag - Madeira Island receives meaningful solar irradiance only during daylight. A binary day/night column lets even a simple tree model split on light vs. dark without needing to reason about solar angles.

  • Solar-angle features - if you pass latitude and longitude, TemporalFeatureTransformer computes the exact solar elevation angle for each timestamp. This is more precise than a fixed day/night cutoff and captures seasonal variation in sunrise/sunset times.

All of these are computed once during fit_transform() and then reproduced identically at prediction time - ensuring train/test consistency.

from twiga.core.data.temporal import TemporalFeatureTransformer

temporal = TemporalFeatureTransformer(
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night", "wday", "month"],
)

data_with_temporal = temporal.fit_transform(data.copy())

# Which columns were added?
original_cols = set(data.columns)
new_cols = [c for c in data_with_temporal.columns if c not in original_cols]
log.info("Original columns : %s", list(original_cols))
log.info("New columns added: %s", new_cols)
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Original columns : ['NetLoad(kW)', 'timestamp', 'Temperature', 'Ghi']
2026-06-14 21:12:30 | INFO     | twiga.tutorials | New columns added: ['index_num', 'year', 'year_iso', 'yearstart', 'yearend', 'leapyear', 'half', 'quarter', 'quarteryear', 'quarterstart', 'quarterend', 'month', 'month_lbl', 'monthstart', 'monthend', 'yweek', 'mweek', 'wday', 'wday_lbl', 'mday', 'qday', 'yday', 'weekend', 'hour', 'minute', 'second', 'msecond', 'nsecond', 'am_pm', 'day_night']
# Inspect the first few rows of the new temporal columns
data_with_temporal[["timestamp", "NetLoad(kW)"] + new_cols].head(8)
timestamp NetLoad(kW) index_num year year_iso yearstart yearend leapyear half quarter ... qday yday weekend hour minute second msecond nsecond am_pm day_night
0 2019-02-01 00:00:00 38.374126 1548979200 2019 2019 0 0 0 1 1 ... 32 32 0 0 0 0 0 0 am 0
1 2019-02-01 00:30:00 39.058097 1548981000 2019 2019 0 0 0 1 1 ... 32 32 0 0 30 0 0 0 am 0
2 2019-02-01 01:00:00 38.374126 1548982800 2019 2019 0 0 0 1 1 ... 32 32 0 1 0 0 0 0 am 0
3 2019-02-01 01:30:00 36.593270 1548984600 2019 2019 0 0 0 1 1 ... 32 32 0 1 30 0 0 0 am 0
4 2019-02-01 02:00:00 36.231781 1548986400 2019 2019 0 0 0 1 1 ... 32 32 0 2 0 0 0 0 am 0
5 2019-02-01 02:30:00 34.680502 1548988200 2019 2019 0 0 0 1 1 ... 32 32 0 2 30 0 0 0 am 0
6 2019-02-01 03:00:00 34.561256 1548990000 2019 2019 0 0 0 1 1 ... 32 32 0 3 0 0 0 0 am 0
7 2019-02-01 03:30:00 33.131887 1548991800 2019 2019 0 0 0 1 1 ... 32 32 0 3 30 0 0 0 am 0

8 rows × 32 columns

# Visualise day_night assignment over a single week
one_week = data_with_temporal[
    (data_with_temporal["timestamp"] >= "2020-06-15") & (data_with_temporal["timestamp"] < "2020-06-22")
].copy()

p = plot_timeseries(
    one_week,
    y_cols=["NetLoad(kW)"],
    date_col="timestamp",
    band_col="day_night",
    band_labels={0: "Night", 1: "Day"},
    title="NetLoad(kW) with day/night context  -  one week (Jun 2020)",
    y_label="NetLoad (kW)",
    x_label="Date",
)
p

3. Autoregressive Features#

Key concept - Lag features

A lag feature is simply the value of the target variable from some number of steps in the past. If electricity demand follows a daily pattern, then “what was demand exactly 24 hours ago?” is an extremely useful input to predict “what will demand be now?”. Lag features let a machine-learning model exploit this temporal memory without you having to design a recurrence mechanism from scratch.

Key concept - Rolling windows

A rolling-window feature summarises recent history into a single number - the mean, standard deviation, or another aggregate over a sliding window. Where a lag feature answers “what was the value at time t−k?”, a rolling mean answers “what was the typical value over the last k steps?”. Rolling statistics smooth out noise and capture trend shifts that individual lags might miss.

AutoregressTransformer creates two families of features:

  • Lag features: value of the target n_samples * lag timesteps ago

  • Rolling features: aggregate (mean, std, …) over a window of n_samples * window timesteps

Understanding the n_samples multiplier#

With 30-minute data there are 48 samples per day (n_samples=48).
Specifying lags=[1, 2, 7] therefore produces lags at:

lag value

Actual shift

Meaning

1

1 × 48 = 48 steps

1 day ago

2

2 × 48 = 96 steps

2 days ago

7

7 × 48 = 336 steps

7 days ago (same weekday)

Similarly, windows=[1, 2, 7] produces rolling windows of 48, 96, and 336 steps.

from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

lag_ref = pd.DataFrame(
    {
        "lag value": ["1", "2", "7"],
        "Actual shift (30-min data)": ["1 × 48 = 48 steps", "2 × 48 = 96 steps", "7 × 48 = 336 steps"],
        "Meaning": ["1 day ago", "2 days ago", "7 days ago (same weekday)"],
        "Why useful": [
            "Yesterday's pattern is the strongest single predictor for today",
            "Short-term trend  -  captures whether demand is rising or falling",
            "Same time last week  -  accounts for weekly seasonality",
        ],
    }
)

twiga_gt(
    GT(lag_ref)
    .tab_header(
        title=md("**Lag Multiplier Reference  -  30-min data, `n_samples=48`**"),
        subtitle="Each lag value is multiplied by n_samples to get the actual timestep shift",
    )
    .cols_label(
        **{
            "lag value": md("**`lag` value**"),
            "Actual shift (30-min data)": md("**Actual shift**"),
            "Meaning": md("**Meaning**"),
            "Why useful": md("**Why useful**"),
        }
    )
    .tab_source_note("Twiga Forecast · NB03  -  Feature Engineering"),
    n_rows=len(lag_ref),
)
Lag Multiplier Reference - 30-min data, n_samples=48
Each lag value is multiplied by n_samples to get the actual timestep shift
lag value Actual shift Meaning Why useful
1 1 × 48 = 48 steps 1 day ago Yesterday's pattern is the strongest single predictor for today
2 2 × 48 = 96 steps 2 days ago Short-term trend - captures whether demand is rising or falling
7 7 × 48 = 336 steps 7 days ago (same weekday) Same time last week - accounts for weekly seasonality
Twiga Forecast · NB03 - Feature Engineering

Key concept - Lag features and rolling windows

Tree-based models (LightGBM, XGBoost, CatBoost) cannot process sequences - they see one flat feature vector per sample. To give them access to the past, we engineer explicit historical features:

  • Lag features - copy the target value from n_samples × lag steps ago. For 30-min data, lag=1 means 1 day ago (48 steps), lag=7 means 1 week ago (336 steps). The forecastability analysis in NB02 showed strong ACF peaks at lags 48 and 336 - those are the lags worth including.

  • Rolling-window statistics - summarise a window of recent values into a single number (mean, std, min, max). A rolling mean smooths short-term noise; a rolling std captures volatility. Window size should be at least as large as the dominant seasonal period.

Both features are aligned so that at time t the model only sees values from before time t - there is no look-ahead leakage. AutoregressTransformer handles this alignment automatically.

from twiga.core.data import AutoregressTransformer

auto_res = AutoregressTransformer(
    n_samples=48,
    lags=[1, 2, 7],  # 1 day, 2 days, 7 days ago
    windows=[1, 2, 7],  # rolling over 1, 2, 7 days
    window_funcs=["mean", "std"],
    value_column="NetLoad(kW)",
)

data_with_ar = auto_res.fit_transform(data.copy())

log.info(f"Rows before: {len(data):,}  |  Rows after: {len(data_with_ar):,}")
log.info(f"(dropped {len(data) - len(data_with_ar):,} rows to remove NaN warm-up period)")
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Rows before: 33,553  |  Rows after: 33,217
2026-06-14 21:12:30 | INFO     | twiga.tutorials | (dropped 336 rows to remove NaN warm-up period)
# Inspect the lag columns generated
lag_cols = [c for c in data_with_ar.columns if "lag" in c]
rolling_cols = [c for c in data_with_ar.columns if "rolling" in c]

log.info("Lag columns    : %s", lag_cols)
log.info("Rolling columns: %s", rolling_cols)
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Lag columns    : ['NetLoad(kW)_lag_48', 'NetLoad(kW)_lag_96', 'NetLoad(kW)_lag_336']
2026-06-14 21:12:30 | INFO     | twiga.tutorials | Rolling columns: ['NetLoad(kW)_rolling_mean_win_48', 'NetLoad(kW)_rolling_std_win_48', 'NetLoad(kW)_rolling_mean_win_96', 'NetLoad(kW)_rolling_std_win_96', 'NetLoad(kW)_rolling_mean_win_336', 'NetLoad(kW)_rolling_std_win_336']
# Show values for a few rows to verify alignment
data_with_ar[["timestamp", "NetLoad(kW)"] + lag_cols + rolling_cols[:4]].head(6)
timestamp NetLoad(kW) NetLoad(kW)_lag_48 NetLoad(kW)_lag_96 NetLoad(kW)_lag_336 NetLoad(kW)_rolling_mean_win_48 NetLoad(kW)_rolling_std_win_48 NetLoad(kW)_rolling_mean_win_96 NetLoad(kW)_rolling_std_win_96
336 2019-02-08 00:00:00+00:00 47.428033 38.933447 41.185390 38.374126 39.821914 10.118282 36.325537 14.014583
337 2019-02-08 00:30:00+00:00 46.096433 38.452917 40.333970 39.058097 39.981154 10.156353 36.385563 14.044241
338 2019-02-08 01:00:00+00:00 42.196757 38.452917 40.333970 38.374126 40.059151 10.158743 36.404967 14.051039
339 2019-02-08 01:30:00+00:00 40.917940 38.452917 40.109637 36.593270 40.110506 10.156680 36.413387 14.053524
340 2019-02-08 02:00:00+00:00 40.896120 37.013040 39.624477 36.231781 40.191403 10.146944 36.426633 14.057182
341 2019-02-08 02:30:00+00:00 39.381770 36.901040 39.555433 34.680502 40.243085 10.136140 36.424824 14.056786
from twiga.core.data.relevance import AssociationAnalyzer
analyzer = AssociationAnalyzer()
# Xi correlation: exogenous variables vs. NetLoad

feature_cols = ["Ghi", "Temperature", "hour", "day_night"] + lag_cols + rolling_cols

# Keep only columns that actually exist after transformation
feature_cols = [c for c in feature_cols if c in data_with_ar.columns]
plots = []
for method in ["pearson", "spearman", "kendall", "xicor", "pps", "mi", "anova"]:
    xicor_df = AssociationAnalyzer.compute(
        data=data_with_ar,
        variable_cols=feature_cols,
        target_col="NetLoad(kW)",
        method=method,
    )

    p = plot_metrics_bar(
        xicor_df,
        metric_col="score",  # This matches the 'score' column from AssociationAnalyzer
        model_col="feature",  # This matches the 'feature' column (your lags/rolling names)
        lower_is_better=False,
        title=f"{method.upper()}",
        x_label=" Score",
        horizontal=True,
        font_size=10,
    )
    plots.append(p)

gggrid(plots, ncol=2) + ggsize(1220, 1000)
# Quick correlation bar chart  -  lag and rolling features vs target
corr_data = data_with_ar[["NetLoad(kW)"] + lag_cols + rolling_cols].dropna()
corr_vals = corr_data.corr()["NetLoad(kW)"].drop("NetLoad(kW)").reset_index()
corr_vals.columns = ["Model", "Correlation"]

p = plot_metrics_bar(
    corr_vals,
    metric_col="Correlation",
    model_col="Model",
    lower_is_better=False,
    title="Pearson correlation of AR features with NetLoad(kW)",
    x_label="Correlation",
    horizontal=True,
)
p

4. Feature Selection#

Key concept - Feature matrix

After applying temporal and autoregressive transformers, we have a feature matrix - a table where each row is one timestep and each column is one input signal the model will see. Feature matrices can easily have 50 - 200 columns, many of which are redundant or noisy. Feature selection trims this down to the most informative subset, reducing overfitting risk and training time.

select_top_features ranks candidate features using a multi-metric ensemble:

  1. Pearson correlation (absolute value)

  2. ANOVA F-score (linear separability)

  3. Mutual information (non-linear dependency)

  4. Random Forest importance (tree-based, captures interactions)

Individual ranks are aggregated via Borda count and the top-k features are returned.
This multi-metric approach is more robust than any single score, especially for non-linear time series patterns.

from twiga.core.data import select_top_features

# Drop rows with NaN before selection
ar_clean = data_with_ar.dropna().copy()

lag_cols = [c for c in ar_clean.columns if "lag" in c]
rolling_cols = [c for c in ar_clean.columns if "rolling" in c]

top_lags = select_top_features(
    data=ar_clean,
    features=lag_cols,
    target="NetLoad(kW)",
    top_k=2,
)

top_rolling = select_top_features(
    data=ar_clean,
    features=rolling_cols,
    target="NetLoad(kW)",
    top_k=2,
)

log.info("Top lag features    : %s", top_lags)
log.info("Top rolling features: %s", top_rolling)
2026-06-14 21:12:33 | INFO     | twiga.core.data.selection | Selected top 2 features using borda_count
2026-06-14 21:12:35 | INFO     | twiga.core.data.selection | Selected top 2 features using borda_count
2026-06-14 21:12:35 | INFO     | twiga.tutorials | Top lag features    : ['NetLoad(kW)_lag_48', 'NetLoad(kW)_lag_96']
2026-06-14 21:12:35 | INFO     | twiga.tutorials | Top rolling features: ['NetLoad(kW)_rolling_mean_win_48', 'NetLoad(kW)_rolling_mean_win_336']
# Visualise the selected lag features side-by-side with the target
sample = ar_clean[(ar_clean["timestamp"] >= "2019-03-01") & (ar_clean["timestamp"] < "2019-03-08")].copy()

p = plot_timeseries(
    sample,
    y_cols=["NetLoad(kW)"] + top_lags,
    date_col="timestamp",
    title="Top selected lag features vs target (one week)",
    y_label="kW",
    x_label="Date",
    series_line_size=0.8,
    fig_size=(960, 300),
)
p

5. Fourier Features#

Calendar variables like hour are cyclic: hour 23 is closer to hour 0 than to hour 12, but a plain integer does not express this.
Fourier encoding projects each value onto a unit circle:

\[ \text{hour\_sin} = \sin\!\left(\frac{2\pi \cdot h}{24}\right), \quad \text{hour\_cos} = \cos\!\left(\frac{2\pi \cdot h}{24}\right) \]

A third convenience column hour_cosin = hour_sin + hour_cos is also added.

TemporalFeatureTransformer applies this automatically for trigonometric features (hour, wday, month).
add_fourier_features lets you do it manually for any column.

Key concept - Fourier encoding for cyclic variables

Calendar variables like hour (0 - 23) and month (1 - 12) are cyclic: hour 23 is just one step away from hour 0, but the integer 23 is far from 0 in Euclidean space. A tree model or neural network that receives raw integers will not naturally understand this wrap-around.

Fourier encoding projects each value onto a unit circle using sine and cosine:

sin_value = sin(2π × value / period)
cos_value = cos(2π × value / period)

For hour with period 24, hours 0 and 23 land at nearly the same point on the circle. The model receives two continuous numbers (sin, cos) per cyclic feature instead of a raw integer - the circular distance is now correctly represented.

Use Fourier encoding whenever a calendar variable has natural wrap-around: hour (period 24), weekday (period 7), month (period 12), day-of-year (period 365).

from twiga.core.data import add_fourier_features

# We start from data_with_temporal which already has the 'hour' integer column
data_fourier = add_fourier_features(
    data_with_temporal.copy(),
    calendar_variables=["hour"],
    periods=[24],
)

fourier_cols = ["hour", "hour_sin", "hour_cos", "hour_cosin"]
log.info("Fourier columns added:")
data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").head(6)
2026-06-14 21:12:35 | INFO     | twiga.tutorials | Fourier columns added:
hour hour_sin hour_cos hour_cosin
0 0 0.000000 1.000000 1.000000
2 1 0.258819 0.965926 1.224745
4 2 0.500000 0.866025 1.366025
6 3 0.707107 0.707107 1.414214
8 4 0.866025 0.500000 1.366025
10 5 0.965926 0.258819 1.224745
# Visualise the sin/cos encoding of hour over one full day
one_day = data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").copy()
one_day["hour_step"] = one_day["hour"].astype(int)

p = plot_timeseries(
    one_day,
    y_cols=["hour_sin", "hour_cos", "hour_cosin"],
    date_col="hour_step",
    title="Fourier encoding of 'hour' (period = 24)",
    y_label="Fourier value",
    x_label="Hour of day",
    fig_size=(300, 250),
    series_line_size=0.8,
)
p

6. Connecting to DataPipelineConfig#

DataPipelineConfig is a Pydantic model that acts as the single source of truth for everything the pipeline needs to know.
The selected features from Section 4 plug directly into past_features.

Key concept - The feature matrix

DataPipeline transforms a raw DataFrame into a pair of 3-D NumPy arrays:

  • X - shape (n_samples, lookback_window_size, n_features) - the input tensor fed to the model. Each sample is a window of lookback_window_size timesteps, each with n_features values (target history + calendar + exogenous + lag + rolling columns).

  • y - shape (n_samples, forecast_horizon, n_targets) - the corresponding future target values the model must predict.

This 3-D format is the universal interface between feature engineering and all Twiga models (ML and NN alike). DataPipeline.fit() learns the scalers from training data; DataPipeline.transform() applies them to any split without refitting - preventing data leakage.

from twiga.core.config import DataPipelineConfig

# Two ways to specify autoregressive features in DataPipelineConfig:
#
#   Option A  -  let the pipeline compute them (standard path with TwigaForecaster):
#     lags=[1,7,48], windows=[1,7], window_funcs=["mean"]
#     → AutoregressTransformer runs internally; pass raw DataFrames to fit/predict
#
#   Option B  -  pre-compute externally and name the columns:
#     past_features=["lag_48", "lag_336", ...]
#     → you must pass the enriched DataFrame (with those columns) everywhere
#
# For TwigaForecaster (the standard path) always use Option A.
data_config = DataPipelineConfig(
    target_feature="NetLoad(kW)",
    period="30min",
    latitude=32.371666,
    longitude=-16.274998,
    calendar_features=["hour", "day_night"],
    known_future_features=["Ghi"],
    lags=[1, 7, 48],  # pipeline computes lag_48, lag_336, lag_2304
    windows=[1, 7],  # rolling windows of 48 and 336 steps
    window_funcs=["mean"],
    forecast_horizon=48,
    window_stride=48,
    lookback_window_size=96,
    input_scaler="robust",
    target_scaler="robust",
)

log.info("\n%s", data_config.model_dump(exclude={"input_scaler", "target_scaler"}))
2026-06-14 21:12:35 | INFO     | twiga.tutorials | 
{'target_feature': 'NetLoad(kW)', 'period': '30min', 'lookback_window_size': 96, 'forecast_horizon': 48, 'latitude': 32.371666, 'longitude': -16.274998, 'past_features': None, 'calendar_features': ['hour', 'day_night'], 'known_future_features': ['Ghi'], 'forecast_period_features': None, 'lags': [1, 7, 48], 'windows': [1, 7], 'window_funcs': ['mean'], 'date_column': 'timestamp', 'n_jobs': 1, 'window_stride': 48}
# Summarise what the config holds
log.info("=== DataPipelineConfig summary ===")
log.info(f"  Target        : {data_config.target_feature}")
log.info(f"  Period        : {data_config.period}")
log.info(f"  Horizon       : {data_config.forecast_horizon} steps")
log.info(f"  Lookback      : {data_config.lookback_window_size} steps")
log.info(f"  Calendar feats: {data_config.calendar_features}")
log.info(f"  Exogenous     : {data_config.known_future_features}")
log.info(f"  Historical    : {data_config.past_features}")
log.info(f"  Input scaler  : {type(data_config.input_scaler).__name__}")
log.info(f"  Target scaler : {type(data_config.target_scaler).__name__}")
2026-06-14 21:12:35 | INFO     | twiga.tutorials | === DataPipelineConfig summary ===
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Target        : NetLoad(kW)
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Period        : 30min
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Horizon       : 48 steps
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Lookback      : 96 steps
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Calendar feats: ['hour', 'day_night']
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Exogenous     : ['Ghi']
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Historical    : None
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Input scaler  : str
2026-06-14 21:12:35 | INFO     | twiga.tutorials |   Target scaler : str

Validation at construction time#

Because DataPipelineConfig is a Pydantic model, invalid values are rejected immediately. Try uncommenting the cell below to see:

# Uncomment to see Pydantic validation in action
# from twiga.core.config import DataPipelineConfig
# bad_config = DataPipelineConfig(
#     target_feature="NetLoad(kW)",
#     period="not-a-period",   # <-- invalid
#     forecast_horizon=48,
#     lookback_window_size=96,
# )

7. How DataPipeline transforms internally#

DataPipeline is a scikit-learn TransformerMixin that builds a processing chain from a DataPipelineConfig.
Internally it:

  1. Runs TemporalFeatureTransformer (if calendar_features are set)

  2. Runs AutoregressTransformer (if lags or windows are set)

  3. Applies input_scaler to numerical features and target_scaler to the target

  4. Slices the scaled data into overlapping (lookback, horizon) sequence windows

The output X has shape (n_samples, lookback_window_size, n_features) and
y has shape (n_samples, forecast_horizon, n_targets) - ready to feed directly into a neural network.

from twiga.core.data import DataPipeline

# DataPipeline takes the same individual parameters as DataPipelineConfig fields.
# Using lags/windows/window_funcs means the pipeline creates AR features internally
#  -  pass the raw DataFrame (no pre-computation needed).
pipeline = DataPipeline(
    target_feature=data_config.target_feature,
    period=data_config.period,
    lookback_window_size=data_config.lookback_window_size,
    forecast_horizon=data_config.forecast_horizon,
    latitude=data_config.latitude,
    longitude=data_config.longitude,
    calendar_features=data_config.calendar_features,
    known_future_features=data_config.known_future_features,
    lags=data_config.lags,
    windows=data_config.windows,
    window_funcs=data_config.window_funcs,
    input_scaler=data_config.input_scaler,
    target_scaler=data_config.target_scaler,
)

# Pass raw train_df  -  pipeline handles all feature computation internally
train_df = data[data["timestamp"] < "2020-01-01"].copy()
pipeline.fit(train_df)
log.info("Pipeline fitted successfully.")
2026-06-14 21:12:36 | INFO     | twiga.tutorials | Pipeline fitted successfully.
X, y = pipeline.transform(train_df)

log.info("X shape: %s", X.shape)  # (n_samples, lookback_window_size, n_features)
log.info("y shape: %s", y.shape)  # (n_samples, forecast_horizon, n_targets)
log.info("")
log.info(f"Each sample provides {X.shape[1]} lookback steps and {y.shape[1]} forecast steps.")
log.info(f"The model sees {X.shape[2]} input features per timestep.")
2026-06-14 21:12:36 | INFO     | twiga.tutorials | X shape: (13585, 144, 9)
2026-06-14 21:12:36 | INFO     | twiga.tutorials | y shape: (13585, 48, 1)
2026-06-14 21:12:36 | INFO     | twiga.tutorials | 
2026-06-14 21:12:36 | INFO     | twiga.tutorials | Each sample provides 144 lookback steps and 48 forecast steps.
2026-06-14 21:12:36 | INFO     | twiga.tutorials | The model sees 9 input features per timestep.
# Visualise a single training window
SAMPLE_IDX = 100  # arbitrary sample

lookback_vals = X[SAMPLE_IDX, :, 0].tolist()
horizon_vals = y[SAMPLE_IDX, :, 0].tolist()

n_lookback = X.shape[1]
n_horizon = y.shape[1]

window_df = pd.DataFrame(
    {
        "step": list(range(n_lookback)) + list(range(n_lookback, n_lookback + n_horizon)),
        "lookback": lookback_vals + [None] * n_horizon,
        "horizon": [None] * n_lookback + horizon_vals,
    }
)

p = plot_timeseries(
    window_df,
    y_cols=["lookback", "horizon"],
    date_col="step",
    title=f"Training window #{SAMPLE_IDX}: lookback ({n_lookback}) + horizon ({n_horizon})",
    y_label="Scaled value",
    x_label="Timestep",
    fig_size=(900, 300),
)
p

8. Summary#

In this notebook you learned how to:

Step

Tool

Output

Temporal features

TemporalFeatureTransformer

hour_sin/cos, wday_sin/cos, month_sin/cos, day_night

Autoregressive features

AutoregressTransformer

lag_NNN, rolling_NNN_mean/std

Feature selection

select_top_features

top-k ranked feature names

Fourier encoding

add_fourier_features

sin/cos columns for any cyclic variable

Pipeline wiring

DataPipelineConfig + DataPipeline

3-D arrays (n_samples, lookback, features) ready for a model


from great_tables import GT, md
import pandas as pd

from twiga.core.plot.gt import twiga_gt

summary_ref = pd.DataFrame(
    {
        "Step": [
            "Temporal features",
            "Autoregressive features",
            "Feature selection",
            "Fourier encoding",
            "Pipeline wiring",
        ],
        "Tool": [
            "`TemporalFeatureTransformer`",
            "`AutoregressTransformer`",
            "`select_top_features`",
            "`add_fourier_features`",
            "`DataPipelineConfig` + `DataPipeline`",
        ],
        "Output": [
            "hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
            "lag_NNN, rolling_NNN_mean/std",
            "top-k ranked feature names",
            "sin/cos columns for any cyclic variable",
            "3-D arrays (n_samples, lookback, features) ready for a model",
        ],
    }
)

twiga_gt(
    GT(summary_ref)
    .tab_header(
        title=md("**NB03  -  Feature Engineering Summary**"),
        subtitle="Every step produces a concrete artifact consumed by the next step",
    )
    .cols_label(
        Step=md("**Step**"),
        Tool=md("**Tool**"),
        Output=md("**Output**"),
    )
    .tab_source_note("Twiga Forecast · NB03  -  Feature Engineering"),
    n_rows=len(summary_ref),
)
NB03 - Feature Engineering Summary
Every step produces a concrete artifact consumed by the next step
Step Tool Output
Temporal features `TemporalFeatureTransformer` hour_sin/cos, wday_sin/cos, month_sin/cos, day_night
Autoregressive features `AutoregressTransformer` lag_NNN, rolling_NNN_mean/std
Feature selection `select_top_features` top-k ranked feature names
Fourier encoding `add_fourier_features` sin/cos columns for any cyclic variable
Pipeline wiring `DataPipelineConfig` + `DataPipeline` 3-D arrays (n_samples, lookback, features) ready for a model
Twiga Forecast · NB03 - Feature Engineering
from great_tables import GT, md

from twiga.core.plot.gt import twiga_gt

summary_df = pd.DataFrame(
    {
        "Step": [
            "Temporal features",
            "Autoregressive features",
            "Feature selection",
            "Fourier encoding",
            "Pipeline wiring",
        ],
        "Tool": [
            "`TemporalFeatureTransformer`",
            "`AutoregressTransformer`",
            "`select_top_features`",
            "`add_fourier_features`",
            "`DataPipelineConfig` + `DataPipeline`",
        ],
        "Output": [
            "hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
            "lag_NNN, rolling_NNN_mean/std",
            "top-k ranked feature names",
            "sin/cos columns for any cyclic variable",
            "3-D arrays (n_samples, lookback, features) ready for a model",
        ],
        "Config parameter": [
            "`calendar_features`, `latitude`, `longitude`",
            "`lags`, `windows`, `window_funcs`",
            "`past_features`",
            "Apply before passing to pipeline",
            "`DataPipelineConfig` fields",
        ],
    }
)

twiga_gt(
    GT(summary_df)
    .tab_header(
        title=md("**NB03  -  Feature Engineering Summary**"),
        subtitle="Five steps from raw DataFrame to model-ready arrays",
    )
    .cols_label(
        Step=md("**Step**"),
        Tool=md("**Tool**"),
        Output=md("**Output**"),
        **{"Config parameter": md("**Config parameter**")},
    )
    .tab_source_note("Twiga Forecast · NB03  -  Feature Engineering"),
    n_rows=len(summary_df),
)
NB03 - Feature Engineering Summary
Five steps from raw DataFrame to model-ready arrays
Step Tool Output Config parameter
Temporal features `TemporalFeatureTransformer` hour_sin/cos, wday_sin/cos, month_sin/cos, day_night `calendar_features`, `latitude`, `longitude`
Autoregressive features `AutoregressTransformer` lag_NNN, rolling_NNN_mean/std `lags`, `windows`, `window_funcs`
Feature selection `select_top_features` top-k ranked feature names `past_features`
Fourier encoding `add_fourier_features` sin/cos columns for any cyclic variable Apply before passing to pipeline
Pipeline wiring `DataPipelineConfig` + `DataPipeline` 3-D arrays (n_samples, lookback, features) ready for a model `DataPipelineConfig` fields
Twiga Forecast · NB03 - Feature Engineering

9. EDA: Visual Feature Inspection#

Before passing engineered features to a model it is worth spending a few minutes on exploratory visualisation. The three helpers below give fast answers to common pre-modelling questions:

Question

Tool

Does the raw signal look as expected?

line_plot - any 1-D window, no DataFrame needed

Do my covariates correlate with the target?

scatter_plot - colour-encoded scatter + LOESS trend

Are lag features jointly informative?

scatter_matrix - pair-plot grid across features and target

All three accept the same Twiga theme parameters (font_size, grid, legend_pos, …) as every other plotting utility in the library.

from twiga.core.plot import line_plot, scatter_matrix, scatter_plot

line_plot: one-week load profile#

line_plot accepts a plain Python list or 1-D NumPy array - no DataFrame required. It is the fastest way to sanity-check a raw signal, inspect seasonality, or verify that a transformation (differencing, scaling) behaved as expected.

When to use it: quick visual check on any univariate sequence before or after feature engineering.

# line_plot takes a plain 1-D sequence  -  no DataFrame needed
sample_week = data_with_temporal[
    (data_with_temporal["timestamp"] >= "2019-06-01") & (data_with_temporal["timestamp"] < "2019-06-08")
]["NetLoad(kW)"].values

p = line_plot(
    x=None,
    y=sample_week,
    title="Net Load  -  one week (Jun 2019)",
    y_label="Net Load (kW)",
    x_label="30-min step",
    fig_size=(900, 300),
)
p

scatter_plot: GHI vs Net Load by day/night#

scatter_plot renders a 2-D scatter with optional colour grouping and a LOESS smoothing curve. It is ideal for assessing whether a covariate has a meaningful (possibly non-linear) relationship with the target before committing to feature selection.

The hue_col parameter expects a string column. Numeric flags like day_night should be mapped first - see the .assign(group=…) call below.

When to use it: check covariate - target correlations and detect regime differences (e.g. day vs night, weekday vs weekend) that a model should learn.

scatter_df = (
    data_with_temporal[["Ghi", "NetLoad(kW)", "day_night"]]
    .dropna()
    .rename(columns={"Ghi": "cycle", "NetLoad(kW)": "value", "day_night": "group"})
    .assign(group=lambda df: df["group"].map({0: "Night", 1: "Day"}))
)

p = scatter_plot(
    scatter_df,
    x_col="cycle",
    y_col="value",
    color_col="group",
    title="GHI vs Net Load  -  coloured by day / night",
)
p

scatter_matrix: lag features vs target#

scatter_matrix renders a pair-plot grid across the selected feature columns and the target. Each off-diagonal cell is a scatter; the diagonal shows the marginal distribution. Colour groups (e.g. day / night) make regime-specific patterns immediately visible.

Use this after AutoregressTransformer to confirm that lag features carry signal and to spot multicollinearity between lags before training.

When to use it: validate that autoregressive features are jointly informative and identify any redundant lags you might want to drop.

matrix_df = (
    data_with_ar.join(data_with_temporal[["day_night"]], how="left")
    .dropna()
    .assign(group=lambda df: df["day_night"].map({0: "Night", 1: "Day"}))
)

lag_cols = [c for c in matrix_df.columns if "_lag_" in c][:3]

p = scatter_matrix(
    matrix_df,
    variables=lag_cols,
    targets=["NetLoad(kW)"],
    hue_col="group",
    n_sample=1000,
    title="Lag features vs Net Load",
    sort_by_variance=True,
    diag="density",
    fig_size=(500, 500),
)
p

Wrapping up#

What you did

  • Added cyclic calendar signals (hour, weekday, month, day/night) using TemporalFeatureTransformer

  • Built lag features and rolling-window statistics using AutoregressTransformer

  • Ranked and selected the most informative features using select_top_features

  • Encoded cyclic variables with Fourier (sin/cos) terms to eliminate boundary discontinuities

  • Wired all feature choices into a DataPipelineConfig and verified the pipeline output shape

Key takeaways for beginners

  1. Calendar features must be encoded cyclically - plain integer hours (0 - 23) tell the model that hour 23 and hour 0 are far apart, which is wrong. Sin/cos encoding places them correctly on a circle.

  2. Lag = yesterday at this time - a lag-1 feature on 30-min data is 48 steps back (24 hours). Always multiply your “intuitive” lag by n_samples.

  3. Rolling means smooth out noise - individual lag values can be noisy; rolling means give the model a stable picture of recent average demand.

  4. Feature selection prevents overfitting - more features is not always better. select_top_features uses four complementary metrics to keep only columns that genuinely predict the target.

  5. DataPipelineConfig is the single source of truth - all feature choices live in one Pydantic config object, making experiments reproducible and easy to share.


What’s next?#

NB04 - ML Point Forecasting shows how to pass DataPipelineConfig into TwigaForecaster, select an ML model (LightGBM, XGBoost, CatBoost, or Linear Regression), and evaluate point predictions using the built-in metrics module.

# ruff: noqa: E501, E701, E702
from IPython.display import HTML

_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"

steps = [
    {
        "num": "01",
        "title": "Getting Started",
        "desc": "Load data · configure pipeline · train LightGBM · evaluate",
        "tags": ["data", "config", "train"],
        "active": False,
    },
    {
        "num": "02",
        "title": "Forecastability Analysis",
        "desc": "Entropy · ACF · stationarity tests",
        "tags": ["entropy", "ACF", "stationarity"],
        "active": False,
    },
    {
        "num": "03",
        "title": "Feature Engineering",
        "desc": "Lag, rolling-window, and calendar features; feature matrix inspection",
        "tags": ["features", "lags", "windows", "calendar"],
        "active": True,
    },
    {
        "num": "04",
        "title": "Time Series Differencing",
        "desc": "Stationarity · first-order and seasonal differencing · inversion",
        "tags": ["differencing", "stationarity"],
        "active": False,
    },
    {
        "num": "05",
        "title": "ML Point Forecasting",
        "desc": "CatBoost · XGBoost · LightGBM · model comparison",
        "tags": ["catboost", "xgboost", "lightgbm"],
        "active": False,
    },
]
track_name = "Beginner Track"
footer = 'Next: handle non-stationarity in <span style="color:#107591;font-weight:600;">04  -  Time Series Differencing</span>, then build your first multi-model comparison in <span style="color:#107591;font-weight:600;">05  -  ML Point Forecasting</span>.'


def _b(t, bg, fg):
    return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'


ch = ""
for i, s in enumerate(steps):
    a = s["active"]
    cb = _TEAL if a else _WHITE
    cbo = _TEAL if a else "#d1ecf1"
    nb = _TEAL_MID if a else _TEAL_LIGHT
    nf = _WHITE if a else _TEAL
    tf = _WHITE if a else _TEXT_DARK
    df = "#cce8ef" if a else _TEXT_MUTED
    bb = "#0d5f75" if a else _TEAL_BEST
    bf = "#b8e4ed" if a else _TEAL
    yh = (
        f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
        if a
        else ""
    )
    bdg = "".join(_b(t, bb, bf) for t in s["tags"])
    ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
    if i < len(steps) - 1:
        ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'

HTML(
    f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path  -  {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)
Twiga Learning Path - Beginner Track
01
Getting Started
Load data · configure pipeline · train LightGBM · evaluate
dataconfigtrain
02
Forecastability Analysis
Entropy · ACF · stationarity tests
entropyACFstationarity
03
Feature Engineering★ you are here
Lag, rolling-window, and calendar features; feature matrix inspection
featureslagswindowscalendar
04
Time Series Differencing
Stationarity · first-order and seasonal differencing · inversion
differencingstationarity
05
ML Point Forecasting
CatBoost · XGBoost · LightGBM · model comparison
catboostxgboostlightgbm
Next: handle non-stationarity in 04 - Time Series Differencing, then build your first multi-model comparison in 05 - ML Point Forecasting.