Feature Engineering#
What you’ll build
A fully-engineered feature matrix from the MLVS-PT net-load signal - including lag features, rolling-window statistics, cyclic calendar encodings, and Fourier terms - wired into a DataPipelineConfig that any Twiga forecaster can consume directly.
Prerequisites
Basic Python (lists, dicts, imports)
NB01 - Getting Started
NB02 - Forecastability Analysis (recommended - provides the parameter rationale)
Learning objectives
By the end of this notebook you will be able to:
Add cyclic calendar signals (hour, weekday, month, day/night) using
TemporalFeatureTransformerBuild lag features and rolling-window statistics using
AutoregressTransformerRank and select the most informative features using
select_top_featuresEncode cyclic variables with Fourier (sin/cos) terms to avoid discontinuities
Wire all feature choices into a
DataPipelineConfigand pass it toDataPipeline
The five-step workflow
Raw data → Temporal features → AR features → Feature selection → DataPipelineConfig
(parquet) (calendar signals) (lags, windows) (rank & filter) (pipeline-ready)
Each section in this notebook maps to one step in the pipeline - by the end you will have a complete, reproducible feature-engineering configuration.
1. Setup#
We import four groups of libraries:
warnings: suppress noisy deprecation messages so output stays readable.great_tables/lets_plot: rendering styled tables and interactive plots inside the notebook.numpy/pandas: array maths and tabular data manipulation.Twiga utilities:
configure()sets up logging;get_logger()gives us a labelled logger for clean output.Twiga feature transformers:
TemporalFeatureTransformeradds calendar features;AutoregressTransformerbuilds lag and rolling-window columns.Twiga plot helpers:
plot_timeseries,line_plot, andscatter_matrixare thin wrappers around LetsPlot tuned for time series work.
import warnings
warnings.filterwarnings("ignore")
from great_tables import GT
from lets_plot import LetsPlot, gggrid, ggsize
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler
LetsPlot.setup_html()
from twiga.core.plot import (
plot_acf,
plot_metrics_bar,
plot_timeseries,
)
from twiga.core.plot.gt import twiga_gt
from twiga.core.utils import configure, get_logger
configure()
log = get_logger("tutorials")
# Load dataset — only the columns we need
COLUMNS = ["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]
raw = pd.read_parquet("../data/MLVS-PT.parquet", columns=COLUMNS)
raw["timestamp"] = pd.to_datetime(raw["timestamp"])
# Filter to the study period
data = raw[(raw["timestamp"] >= "2019-01-01") & (raw["timestamp"] <= "2020-12-31")].copy()
data = data.reset_index(drop=True)
log.info(f"Shape : {data.shape}")
log.info(f"Period: {data['timestamp'].min()} -> {data['timestamp'].max()}")
twiga_gt(GT(data.head()))
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 4
1 # Load dataset — only the columns we need
2 COLUMNS = ["timestamp", "NetLoad(kW)", "Ghi", "Temperature"]
3
----> 4 raw = pd.read_parquet("../data/MLVS-PT.parquet", columns=COLUMNS)
5 raw["timestamp"] = pd.to_datetime(raw["timestamp"])
6
7 # Filter to the study period
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
256 if manager == "array":
257 to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
261 storage_options=storage_options,
262 mode="rb",
263 )
264 try:
265 pa_table = self.api.parquet.read_table(
266 path_or_handle,
267 columns=columns,
(...) 270 **kwargs,
271 )
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
131 handles = None
132 if (
133 not fs
134 and not is_dir
(...) 139 # fsspec resources can also point to directories
140 # this branch is used for example when reading from non-fsspec URLs
--> 141 handles = get_handle(
142 path_or_handle, mode, is_text=False, storage_options=storage_options
143 )
144 fs = None
145 path_or_handle = handles.handle
File ~/work/twiga-forecast/twiga-forecast/.venv/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
873 handle = open(
874 handle,
875 ioargs.mode,
(...) 878 newline="",
879 )
880 else:
881 # Binary mode
--> 882 handle = open(handle, ioargs.mode)
883 handles.append(handle)
885 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: '../data/MLVS-PT.parquet'
2. Temporal Features#
Key concept - Calendar features
A time series model sees numbers, not timestamps. Calendar features translate the human notion of time (“it’s 8 am on a Monday”) into numbers a model can use as inputs. The trick is to encode cyclic variables (hour 23 is close to hour 0, not far from it) using sin/cos pairs so the model sees the circular structure rather than a linear scale.
TemporalFeatureTransformer adds calendar signals to a DataFrame.
It wraps three building blocks:
Feature |
Type |
Description |
|---|---|---|
|
trigonometric (sin/cos) |
Hour of day (0 - 23) |
|
trigonometric (sin/cos) |
Day of week (0 - 6) |
|
trigonometric (sin/cos) |
Month of year (1 - 12) |
|
binary |
1 = daytime, 0 = night |
day_nightand solar geometry: Twiga uses theastrallibrary to compute accurate sunrise/sunset times per day for the specified location.
Latitude and longitude are therefore required wheneverday_nightis listed incalendar_features. The Madeira Island coordinates below correspond to the MLVS-PT measurement site.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
temporal_ref = pd.DataFrame(
{
"Feature": ["`hour`", "`wday`", "`month`", "`day_night`"],
"Type": ["trigonometric (sin/cos)", "trigonometric (sin/cos)", "trigonometric (sin/cos)", "binary"],
"Description": ["Hour of day (0–23)", "Day of week (0–6)", "Month of year (1–12)", "1 = daytime, 0 = night"],
"Why it matters": [
"Captures the daily demand cycle — morning peak, midday lull, evening peak",
"Captures weekend vs. weekday demand differences",
"Captures seasonal heating/cooling patterns across the year",
"Direct proxy for solar generation (GHI ≈ 0 at night)",
],
}
)
twiga_gt(
GT(temporal_ref)
.tab_header(
title=md("**Calendar Features — Reference Guide**"),
subtitle="All cyclic features are encoded as sin/cos pairs to preserve circular structure",
)
.cols_label(
Feature=md("**Feature**"),
Type=md("**Type**"),
Description=md("**Description**"),
**{"Why it matters": md("**Why it matters**")},
)
.tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
n_rows=len(temporal_ref),
)
Key concept - Temporal features
Raw timestamps carry rich information, but a model cannot interpret a datetime object directly. Temporal features extract the useful dimensions:
Calendar features (
hour,weekday,month) - tell the model where in the daily or weekly cycle a sample falls. Net load is typically higher on weekday mornings than Sunday nights; a model with nohourfeature cannot learn this.Day/night flag - Madeira Island receives meaningful solar irradiance only during daylight. A binary day/night column lets even a simple tree model split on light vs. dark without needing to reason about solar angles.
Solar-angle features - if you pass latitude and longitude,
TemporalFeatureTransformercomputes the exact solar elevation angle for each timestamp. This is more precise than a fixed day/night cutoff and captures seasonal variation in sunrise/sunset times.All of these are computed once during
fit_transform()and then reproduced identically at prediction time - ensuring train/test consistency.
from twiga.core.data.temporal import TemporalFeatureTransformer
temporal = TemporalFeatureTransformer(
latitude=32.371666,
longitude=-16.274998,
calendar_features=["hour", "day_night", "wday", "month"],
)
data_with_temporal = temporal.fit_transform(data.copy())
# Which columns were added?
original_cols = set(data.columns)
new_cols = [c for c in data_with_temporal.columns if c not in original_cols]
log.info("Original columns : %s", list(original_cols))
log.info("New columns added: %s", new_cols)
# Inspect the first few rows of the new temporal columns
data_with_temporal[["timestamp", "NetLoad(kW)"] + new_cols].head(8)
# Visualise day_night assignment over a single week
one_week = data_with_temporal[
(data_with_temporal["timestamp"] >= "2020-06-15") & (data_with_temporal["timestamp"] < "2020-06-22")
].copy()
p = plot_timeseries(
one_week,
y_cols=["NetLoad(kW)"],
date_col="timestamp",
band_col="day_night",
band_labels={0: "Night", 1: "Day"},
title="NetLoad(kW) with day/night context — one week (Jun 2020)",
y_label="NetLoad (kW)",
x_label="Date",
)
p
3. Autoregressive Features#
Key concept - Lag features
A lag feature is simply the value of the target variable from some number of steps in the past. If electricity demand follows a daily pattern, then “what was demand exactly 24 hours ago?” is an extremely useful input to predict “what will demand be now?”. Lag features let a machine-learning model exploit this temporal memory without you having to design a recurrence mechanism from scratch.
Key concept - Rolling windows
A rolling-window feature summarises recent history into a single number - the mean, standard deviation, or another aggregate over a sliding window. Where a lag feature answers “what was the value at time t−k?”, a rolling mean answers “what was the typical value over the last k steps?”. Rolling statistics smooth out noise and capture trend shifts that individual lags might miss.
AutoregressTransformer creates two families of features:
Lag features: value of the target
n_samples * lagtimesteps agoRolling features: aggregate (mean, std, …) over a window of
n_samples * windowtimesteps
Understanding the n_samples multiplier#
With 30-minute data there are 48 samples per day (n_samples=48).
Specifying lags=[1, 2, 7] therefore produces lags at:
|
Actual shift |
Meaning |
|---|---|---|
1 |
1 × 48 = 48 steps |
1 day ago |
2 |
2 × 48 = 96 steps |
2 days ago |
7 |
7 × 48 = 336 steps |
7 days ago (same weekday) |
Similarly, windows=[1, 2, 7] produces rolling windows of 48, 96, and 336 steps.
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
lag_ref = pd.DataFrame(
{
"lag value": ["1", "2", "7"],
"Actual shift (30-min data)": ["1 × 48 = 48 steps", "2 × 48 = 96 steps", "7 × 48 = 336 steps"],
"Meaning": ["1 day ago", "2 days ago", "7 days ago (same weekday)"],
"Why useful": [
"Yesterday's pattern is the strongest single predictor for today",
"Short-term trend — captures whether demand is rising or falling",
"Same time last week — accounts for weekly seasonality",
],
}
)
twiga_gt(
GT(lag_ref)
.tab_header(
title=md("**Lag Multiplier Reference — 30-min data, `n_samples=48`**"),
subtitle="Each lag value is multiplied by n_samples to get the actual timestep shift",
)
.cols_label(
**{
"lag value": md("**`lag` value**"),
"Actual shift (30-min data)": md("**Actual shift**"),
"Meaning": md("**Meaning**"),
"Why useful": md("**Why useful**"),
}
)
.tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
n_rows=len(lag_ref),
)
Key concept - Lag features and rolling windows
Tree-based models (LightGBM, XGBoost, CatBoost) cannot process sequences - they see one flat feature vector per sample. To give them access to the past, we engineer explicit historical features:
Lag features - copy the target value from
n_samples × lagsteps ago. For 30-min data,lag=1means 1 day ago (48 steps),lag=7means 1 week ago (336 steps). The forecastability analysis in NB02 showed strong ACF peaks at lags 48 and 336 - those are the lags worth including.Rolling-window statistics - summarise a window of recent values into a single number (mean, std, min, max). A rolling mean smooths short-term noise; a rolling std captures volatility. Window size should be at least as large as the dominant seasonal period.
Both features are aligned so that at time t the model only sees values from before time t - there is no look-ahead leakage.
AutoregressTransformerhandles this alignment automatically.
from twiga.core.data import AutoregressTransformer
auto_res = AutoregressTransformer(
n_samples=48,
lags=[1, 2, 7], # 1 day, 2 days, 7 days ago
windows=[1, 2, 7], # rolling over 1, 2, 7 days
window_funcs=["mean", "std"],
value_column="NetLoad(kW)",
)
data_with_ar = auto_res.fit_transform(data.copy())
log.info(f"Rows before: {len(data):,} | Rows after: {len(data_with_ar):,}")
log.info(f"(dropped {len(data) - len(data_with_ar):,} rows to remove NaN warm-up period)")
# Inspect the lag columns generated
lag_cols = [c for c in data_with_ar.columns if "lag" in c]
rolling_cols = [c for c in data_with_ar.columns if "rolling" in c]
log.info("Lag columns : %s", lag_cols)
log.info("Rolling columns: %s", rolling_cols)
# Show values for a few rows to verify alignment
data_with_ar[["timestamp", "NetLoad(kW)"] + lag_cols + rolling_cols[:4]].head(6)
from twiga.core.data.relevance import AssociationAnalyzer
analyzer = AssociationAnalyzer()
# Xi correlation: exogenous variables vs. NetLoad
feature_cols = ["Ghi", "Temperature", "hour", "day_night"] + lag_cols + rolling_cols
# Keep only columns that actually exist after transformation
feature_cols = [c for c in feature_cols if c in data_with_ar.columns]
plots = []
for method in ["pearson", "spearman", "kendall", "xicor", "pps", "mi", "anova"]:
xicor_df = AssociationAnalyzer.compute(
data=data_with_ar,
variable_cols=feature_cols,
target_col="NetLoad(kW)",
method=method,
)
p = plot_metrics_bar(
xicor_df,
metric_col="score", # This matches the 'score' column from AssociationAnalyzer
model_col="feature", # This matches the 'feature' column (your lags/rolling names)
lower_is_better=False,
title=f"{method.upper()}",
x_label=" Score",
horizontal=True,
font_size=10,
)
plots.append(p)
gggrid(plots, ncol=2) + ggsize(1220, 1000)
# Quick correlation bar chart — lag and rolling features vs target
corr_data = data_with_ar[["NetLoad(kW)"] + lag_cols + rolling_cols].dropna()
corr_vals = corr_data.corr()["NetLoad(kW)"].drop("NetLoad(kW)").reset_index()
corr_vals.columns = ["Model", "Correlation"]
p = plot_metrics_bar(
corr_vals,
metric_col="Correlation",
model_col="Model",
lower_is_better=False,
title="Pearson correlation of AR features with NetLoad(kW)",
x_label="Correlation",
horizontal=True,
)
p
4. Feature Selection#
Key concept - Feature matrix
After applying temporal and autoregressive transformers, we have a feature matrix - a table where each row is one timestep and each column is one input signal the model will see. Feature matrices can easily have 50 - 200 columns, many of which are redundant or noisy. Feature selection trims this down to the most informative subset, reducing overfitting risk and training time.
select_top_features ranks candidate features using a multi-metric ensemble:
Pearson correlation (absolute value)
ANOVA F-score (linear separability)
Mutual information (non-linear dependency)
Random Forest importance (tree-based, captures interactions)
Individual ranks are aggregated via Borda count and the top-k features are returned.
This multi-metric approach is more robust than any single score, especially for non-linear time series patterns.
from twiga.core.data import select_top_features
# Drop rows with NaN before selection
ar_clean = data_with_ar.dropna().copy()
lag_cols = [c for c in ar_clean.columns if "lag" in c]
rolling_cols = [c for c in ar_clean.columns if "rolling" in c]
top_lags = select_top_features(
data=ar_clean,
features=lag_cols,
target="NetLoad(kW)",
top_k=2,
)
top_rolling = select_top_features(
data=ar_clean,
features=rolling_cols,
target="NetLoad(kW)",
top_k=2,
)
log.info("Top lag features : %s", top_lags)
log.info("Top rolling features: %s", top_rolling)
# Visualise the selected lag features side-by-side with the target
sample = ar_clean[(ar_clean["timestamp"] >= "2019-03-01") & (ar_clean["timestamp"] < "2019-03-08")].copy()
p = plot_timeseries(
sample,
y_cols=["NetLoad(kW)"] + top_lags,
date_col="timestamp",
title="Top selected lag features vs target (one week)",
y_label="kW",
x_label="Date",
series_line_size=0.8,
fig_size=(960, 300),
)
p
5. Fourier Features#
Calendar variables like hour are cyclic: hour 23 is closer to hour 0 than to hour 12, but a plain integer does not express this.
Fourier encoding projects each value onto a unit circle:
A third convenience column hour_cosin = hour_sin + hour_cos is also added.
TemporalFeatureTransformerapplies this automatically for trigonometric features (hour, wday, month).
add_fourier_featureslets you do it manually for any column.
Key concept - Fourier encoding for cyclic variables
Calendar variables like
hour(0 - 23) andmonth(1 - 12) are cyclic: hour 23 is just one step away from hour 0, but the integer 23 is far from 0 in Euclidean space. A tree model or neural network that receives raw integers will not naturally understand this wrap-around.Fourier encoding projects each value onto a unit circle using sine and cosine:
sin_value = sin(2π × value / period) cos_value = cos(2π × value / period)For
hourwith period 24, hours 0 and 23 land at nearly the same point on the circle. The model receives two continuous numbers (sin, cos) per cyclic feature instead of a raw integer - the circular distance is now correctly represented.Use Fourier encoding whenever a calendar variable has natural wrap-around: hour (period 24), weekday (period 7), month (period 12), day-of-year (period 365).
from twiga.core.data import add_fourier_features
# We start from data_with_temporal which already has the 'hour' integer column
data_fourier = add_fourier_features(
data_with_temporal.copy(),
calendar_variables=["hour"],
periods=[24],
)
fourier_cols = ["hour", "hour_sin", "hour_cos", "hour_cosin"]
log.info("Fourier columns added:")
data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").head(6)
# Visualise the sin/cos encoding of hour over one full day
one_day = data_fourier[fourier_cols].drop_duplicates(subset="hour").sort_values("hour").copy()
one_day["hour_step"] = one_day["hour"].astype(int)
p = plot_timeseries(
one_day,
y_cols=["hour_sin", "hour_cos", "hour_cosin"],
date_col="hour_step",
title="Fourier encoding of 'hour' (period = 24)",
y_label="Fourier value",
x_label="Hour of day",
fig_size=(300, 250),
series_line_size=0.8,
)
p
6. Connecting to DataPipelineConfig#
DataPipelineConfig is a Pydantic model that acts as the single source of truth for everything the pipeline needs to know.
The selected features from Section 4 plug directly into historical_features.
Key concept - The feature matrix
DataPipelinetransforms a raw DataFrame into a pair of 3-D NumPy arrays:
X - shape
(n_samples, lookback_window_size, n_features)- the input tensor fed to the model. Each sample is a window oflookback_window_sizetimesteps, each withn_featuresvalues (target history + calendar + exogenous + lag + rolling columns).y - shape
(n_samples, forecast_horizon, n_targets)- the corresponding future target values the model must predict.This 3-D format is the universal interface between feature engineering and all Twiga models (ML and NN alike).
DataPipeline.fit()learns the scalers from training data;DataPipeline.transform()applies them to any split without refitting - preventing data leakage.
from twiga.core.config import DataPipelineConfig
# Two ways to specify autoregressive features in DataPipelineConfig:
#
# Option A — let the pipeline compute them (standard path with TwigaForecaster):
# lags=[1,7,48], windows=[1,7], window_funcs=["mean"]
# → AutoregressTransformer runs internally; pass raw DataFrames to fit/predict
#
# Option B — pre-compute externally and name the columns:
# historical_features=["lag_48", "lag_336", ...]
# → you must pass the enriched DataFrame (with those columns) everywhere
#
# For TwigaForecaster (the standard path) always use Option A.
data_config = DataPipelineConfig(
target_feature="NetLoad(kW)",
period="30min",
latitude=32.371666,
longitude=-16.274998,
calendar_features=["hour", "day_night"],
exogenous_features=["Ghi"],
lags=[1, 7, 48], # pipeline computes lag_48, lag_336, lag_2304
windows=[1, 7], # rolling windows of 48 and 336 steps
window_funcs=["mean"],
forecast_horizon=48,
lookback_window_size=96,
input_scaler=RobustScaler(),
target_scaler=RobustScaler(),
)
log.info("\n%s", data_config.model_dump(exclude={"input_scaler", "target_scaler"}))
# Summarise what the config holds
log.info("=== DataPipelineConfig summary ===")
log.info(f" Target : {data_config.target_feature}")
log.info(f" Period : {data_config.period}")
log.info(f" Horizon : {data_config.forecast_horizon} steps")
log.info(f" Lookback : {data_config.lookback_window_size} steps")
log.info(f" Calendar feats: {data_config.calendar_features}")
log.info(f" Exogenous : {data_config.exogenous_features}")
log.info(f" Historical : {data_config.historical_features}")
log.info(f" Input scaler : {type(data_config.input_scaler).__name__}")
log.info(f" Target scaler : {type(data_config.target_scaler).__name__}")
Validation at construction time#
Because DataPipelineConfig is a Pydantic model, invalid values are rejected immediately. Try uncommenting the cell below to see:
# Uncomment to see Pydantic validation in action
# from twiga.core.config import DataPipelineConfig
# bad_config = DataPipelineConfig(
# target_feature="NetLoad(kW)",
# period="not-a-period", # <-- invalid
# forecast_horizon=48,
# lookback_window_size=96,
# )
7. How DataPipeline transforms internally#
DataPipeline is a scikit-learn TransformerMixin that builds a processing chain from a DataPipelineConfig.
Internally it:
Runs
TemporalFeatureTransformer(ifcalendar_featuresare set)Runs
AutoregressTransformer(iflagsorwindowsare set)Applies
input_scalerto numerical features andtarget_scalerto the targetSlices the scaled data into overlapping (lookback, horizon) sequence windows
The output X has shape (n_samples, lookback_window_size, n_features) and
y has shape (n_samples, forecast_horizon, n_targets) - ready to feed directly into a neural network.
from twiga.core.data import DataPipeline
# DataPipeline takes the same individual parameters as DataPipelineConfig fields.
# Using lags/windows/window_funcs means the pipeline creates AR features internally
# — pass the raw DataFrame (no pre-computation needed).
pipeline = DataPipeline(
target_feature=data_config.target_feature,
period=data_config.period,
lookback_window_size=data_config.lookback_window_size,
forecast_horizon=data_config.forecast_horizon,
latitude=data_config.latitude,
longitude=data_config.longitude,
calendar_features=data_config.calendar_features,
exogenous_features=data_config.exogenous_features,
lags=data_config.lags,
windows=data_config.windows,
window_funcs=data_config.window_funcs,
input_scaler=data_config.input_scaler,
target_scaler=data_config.target_scaler,
)
# Pass raw train_df — pipeline handles all feature computation internally
train_df = data[data["timestamp"] < "2020-01-01"].copy()
pipeline.fit(train_df)
log.info("Pipeline fitted successfully.")
X, y = pipeline.transform(train_df)
log.info("X shape: %s", X.shape) # (n_samples, lookback_window_size, n_features)
log.info("y shape: %s", y.shape) # (n_samples, forecast_horizon, n_targets)
log.info("")
log.info(f"Each sample provides {X.shape[1]} lookback steps and {y.shape[1]} forecast steps.")
log.info(f"The model sees {X.shape[2]} input features per timestep.")
# Visualise a single training window
SAMPLE_IDX = 100 # arbitrary sample
lookback_vals = X[SAMPLE_IDX, :, 0].tolist()
horizon_vals = y[SAMPLE_IDX, :, 0].tolist()
n_lookback = X.shape[1]
n_horizon = y.shape[1]
window_df = pd.DataFrame(
{
"step": list(range(n_lookback)) + list(range(n_lookback, n_lookback + n_horizon)),
"lookback": lookback_vals + [None] * n_horizon,
"horizon": [None] * n_lookback + horizon_vals,
}
)
p = plot_timeseries(
window_df,
y_cols=["lookback", "horizon"],
date_col="step",
title=f"Training window #{SAMPLE_IDX}: lookback ({n_lookback}) + horizon ({n_horizon})",
y_label="Scaled value",
x_label="Timestep",
fig_size=(900, 300),
)
p
8. Summary#
In this notebook you learned how to:
Step |
Tool |
Output |
|---|---|---|
Temporal features |
|
hour_sin/cos, wday_sin/cos, month_sin/cos, day_night |
Autoregressive features |
|
lag_NNN, rolling_NNN_mean/std |
Feature selection |
|
top-k ranked feature names |
Fourier encoding |
|
sin/cos columns for any cyclic variable |
Pipeline wiring |
|
3-D arrays (n_samples, lookback, features) ready for a model |
from great_tables import GT, md
import pandas as pd
from twiga.core.plot.gt import twiga_gt
summary_ref = pd.DataFrame(
{
"Step": [
"Temporal features",
"Autoregressive features",
"Feature selection",
"Fourier encoding",
"Pipeline wiring",
],
"Tool": [
"`TemporalFeatureTransformer`",
"`AutoregressTransformer`",
"`select_top_features`",
"`add_fourier_features`",
"`DataPipelineConfig` + `DataPipeline`",
],
"Output": [
"hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
"lag_NNN, rolling_NNN_mean/std",
"top-k ranked feature names",
"sin/cos columns for any cyclic variable",
"3-D arrays (n_samples, lookback, features) ready for a model",
],
}
)
twiga_gt(
GT(summary_ref)
.tab_header(
title=md("**NB03 — Feature Engineering Summary**"),
subtitle="Every step produces a concrete artifact consumed by the next step",
)
.cols_label(
Step=md("**Step**"),
Tool=md("**Tool**"),
Output=md("**Output**"),
)
.tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
n_rows=len(summary_ref),
)
from great_tables import GT, md
from twiga.core.plot.gt import twiga_gt
summary_df = pd.DataFrame(
{
"Step": [
"Temporal features",
"Autoregressive features",
"Feature selection",
"Fourier encoding",
"Pipeline wiring",
],
"Tool": [
"`TemporalFeatureTransformer`",
"`AutoregressTransformer`",
"`select_top_features`",
"`add_fourier_features`",
"`DataPipelineConfig` + `DataPipeline`",
],
"Output": [
"hour_sin/cos, wday_sin/cos, month_sin/cos, day_night",
"lag_NNN, rolling_NNN_mean/std",
"top-k ranked feature names",
"sin/cos columns for any cyclic variable",
"3-D arrays (n_samples, lookback, features) ready for a model",
],
"Config parameter": [
"`calendar_features`, `latitude`, `longitude`",
"`lags`, `windows`, `window_funcs`",
"`historical_features`",
"Apply before passing to pipeline",
"`DataPipelineConfig` fields",
],
}
)
twiga_gt(
GT(summary_df)
.tab_header(
title=md("**NB03 — Feature Engineering Summary**"),
subtitle="Five steps from raw DataFrame to model-ready arrays",
)
.cols_label(
Step=md("**Step**"),
Tool=md("**Tool**"),
Output=md("**Output**"),
**{"Config parameter": md("**Config parameter**")},
)
.tab_source_note("Twiga Forecast · NB03 — Feature Engineering"),
n_rows=len(summary_df),
)
9. EDA: Visual Feature Inspection#
Before passing engineered features to a model it is worth spending a few minutes on exploratory visualisation. The three helpers below give fast answers to common pre-modelling questions:
Question |
Tool |
|---|---|
Does the raw signal look as expected? |
|
Do my covariates correlate with the target? |
|
Are lag features jointly informative? |
|
All three accept the same Twiga theme parameters (font_size, grid, legend_pos, …)
as every other plotting utility in the library.
from twiga.core.plot import line_plot, scatter_matrix, scatter_plot
line_plot: one-week load profile#
line_plot accepts a plain Python list or 1-D NumPy array - no DataFrame required.
It is the fastest way to sanity-check a raw signal, inspect seasonality, or verify
that a transformation (differencing, scaling) behaved as expected.
When to use it: quick visual check on any univariate sequence before or after feature engineering.
# line_plot takes a plain 1-D sequence — no DataFrame needed
sample_week = data_with_temporal[
(data_with_temporal["timestamp"] >= "2019-06-01") & (data_with_temporal["timestamp"] < "2019-06-08")
]["NetLoad(kW)"].values
p = line_plot(
x=None,
y=sample_week,
title="Net Load — one week (Jun 2019)",
y_label="Net Load (kW)",
x_label="30-min step",
fig_size=(900, 300),
)
p
scatter_plot: GHI vs Net Load by day/night#
scatter_plot renders a 2-D scatter with optional colour grouping and a LOESS
smoothing curve. It is ideal for assessing whether a covariate has a meaningful
(possibly non-linear) relationship with the target before committing to feature selection.
The hue_col parameter expects a string column. Numeric flags like day_night
should be mapped first - see the .assign(group=…) call below.
When to use it: check covariate - target correlations and detect regime differences (e.g. day vs night, weekday vs weekend) that a model should learn.
scatter_df = (
data_with_temporal[["Ghi", "NetLoad(kW)", "day_night"]]
.dropna()
.rename(columns={"Ghi": "cycle", "NetLoad(kW)": "value", "day_night": "group"})
.assign(group=lambda df: df["group"].map({0: "Night", 1: "Day"}))
)
p = scatter_plot(
scatter_df,
x_col="cycle",
y_col="value",
color_col="group",
title="GHI vs Net Load — coloured by day / night",
)
p
scatter_matrix: lag features vs target#
scatter_matrix renders a pair-plot grid across the selected feature columns and the
target. Each off-diagonal cell is a scatter; the diagonal shows the marginal distribution.
Colour groups (e.g. day / night) make regime-specific patterns immediately visible.
Use this after AutoregressTransformer to confirm that lag features carry signal and
to spot multicollinearity between lags before training.
When to use it: validate that autoregressive features are jointly informative and identify any redundant lags you might want to drop.
matrix_df = (
data_with_ar.join(data_with_temporal[["day_night"]], how="left")
.dropna()
.assign(group=lambda df: df["day_night"].map({0: "Night", 1: "Day"}))
)
lag_cols = [c for c in matrix_df.columns if "_lag_" in c][:3]
p = scatter_matrix(
matrix_df,
variables=lag_cols,
targets=["NetLoad(kW)"],
hue_col="group",
n_sample=1000,
title="Lag features vs Net Load",
sort_by_variance=True,
diag="density",
fig_size=(500, 500),
)
p
Wrapping up#
What you did
Added cyclic calendar signals (hour, weekday, month, day/night) using
TemporalFeatureTransformerBuilt lag features and rolling-window statistics using
AutoregressTransformerRanked and selected the most informative features using
select_top_featuresEncoded cyclic variables with Fourier (sin/cos) terms to eliminate boundary discontinuities
Wired all feature choices into a
DataPipelineConfigand verified the pipeline output shape
Key takeaways for beginners
Calendar features must be encoded cyclically - plain integer hours (0 - 23) tell the model that hour 23 and hour 0 are far apart, which is wrong. Sin/cos encoding places them correctly on a circle.
Lag = yesterday at this time - a lag-1 feature on 30-min data is 48 steps back (24 hours). Always multiply your “intuitive” lag by
n_samples.Rolling means smooth out noise - individual lag values can be noisy; rolling means give the model a stable picture of recent average demand.
Feature selection prevents overfitting - more features is not always better.
select_top_featuresuses four complementary metrics to keep only columns that genuinely predict the target.DataPipelineConfigis the single source of truth - all feature choices live in one Pydantic config object, making experiments reproducible and easy to share.
What’s next?#
NB04 - ML Point Forecasting shows how to pass DataPipelineConfig into TwigaForecaster, select an ML model (LightGBM, XGBoost, CatBoost, or Linear Regression), and evaluate point predictions using the built-in metrics module.
# ruff: noqa: E501, E701, E702
from IPython.display import HTML
_TEAL = "#107591"
_TEAL_MID = "#069fac"
_TEAL_LIGHT = "#e8f5f8"
_TEAL_BEST = "#d0ecf1"
_TEXT_DARK = "#2d3748"
_TEXT_MUTED = "#718096"
_WHITE = "#ffffff"
steps = [
{
"num": "01",
"title": "Getting Started",
"desc": "Load data · configure pipeline · train LightGBM · evaluate",
"tags": ["data", "config", "train"],
"active": False,
},
{
"num": "02",
"title": "Forecastability Analysis",
"desc": "Entropy · ACF · stationarity tests",
"tags": ["entropy", "ACF", "stationarity"],
"active": False,
},
{
"num": "03",
"title": "Feature Engineering",
"desc": "Lag, rolling-window, and calendar features; feature matrix inspection",
"tags": ["features", "lags", "windows", "calendar"],
"active": True,
},
{
"num": "04",
"title": "Time Series Differencing",
"desc": "Stationarity · first-order and seasonal differencing · inversion",
"tags": ["differencing", "stationarity"],
"active": False,
},
{
"num": "05",
"title": "ML Point Forecasting",
"desc": "CatBoost · XGBoost · LightGBM · model comparison",
"tags": ["catboost", "xgboost", "lightgbm"],
"active": False,
},
]
track_name = "Beginner Track"
footer = 'Next: handle non-stationarity in <span style="color:#107591;font-weight:600;">04 — Time Series Differencing</span>, then build your first multi-model comparison in <span style="color:#107591;font-weight:600;">05 — ML Point Forecasting</span>.'
def _b(t, bg, fg):
return f'<span style="display:inline-block;background:{bg};color:{fg};font-size:10px;font-weight:600;padding:2px 7px;border-radius:10px;margin:2px 2px 0 0;">{t}</span>'
ch = ""
for i, s in enumerate(steps):
a = s["active"]
cb = _TEAL if a else _WHITE
cbo = _TEAL if a else "#d1ecf1"
nb = _TEAL_MID if a else _TEAL_LIGHT
nf = _WHITE if a else _TEAL
tf = _WHITE if a else _TEXT_DARK
df = "#cce8ef" if a else _TEXT_MUTED
bb = "#0d5f75" if a else _TEAL_BEST
bf = "#b8e4ed" if a else _TEAL
yh = (
f'<span style="float:right;background:{_TEAL_MID};color:{_WHITE};font-size:10px;font-weight:700;padding:2px 10px;border-radius:12px;">★ you are here</span>'
if a
else ""
)
bdg = "".join(_b(t, bb, bf) for t in s["tags"])
ch += f'<div style="background:{cb};border:2px solid {cbo};border-radius:12px;padding:16px 20px;display:flex;align-items:flex-start;gap:16px;box-shadow:{"0 4px 14px rgba(16,117,145,.25)" if a else "0 1px 4px rgba(0,0,0,.06)"};"><div style="min-width:44px;height:44px;background:{nb};color:{nf};border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:15px;font-weight:800;flex-shrink:0;">{s["num"]}</div><div style="flex:1;"><div style="font-size:15px;font-weight:700;color:{tf};margin-bottom:4px;">{s["title"]}{yh}</div><div style="font-size:12.5px;color:{df};margin-bottom:8px;line-height:1.5;">{s["desc"]}</div><div>{bdg}</div></div></div>'
if i < len(steps) - 1:
ch += f'<div style="display:flex;justify-content:center;height:32px;"><svg width="24" height="32" viewBox="0 0 24 32" fill="none"><line x1="12" y1="0" x2="12" y2="24" stroke="{_TEAL_MID}" stroke-width="2" stroke-dasharray="4 3"/><polygon points="6,20 18,20 12,30" fill="{_TEAL_MID}"/></svg></div>'
HTML(
f'<div style="font-family:Inter,\'Segoe UI\',sans-serif;max-width:640px;margin:8px 0;"><div style="background:linear-gradient(135deg,{_TEAL} 0%,{_TEAL_MID} 100%);border-radius:12px 12px 0 0;padding:14px 20px;display:flex;align-items:center;gap:10px;"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="{_WHITE}" stroke-width="2"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg><span style="color:{_WHITE};font-size:14px;font-weight:700;">Twiga Learning Path — {track_name}</span></div><div style="border:2px solid {_TEAL_LIGHT};border-top:none;border-radius:0 0 12px 12px;padding:20px 20px 16px;background:#f9fdfe;display:flex;flex-direction:column;">{ch}<div style="margin-top:16px;font-size:11.5px;color:{_TEXT_MUTED};text-align:center;border-top:1px solid {_TEAL_LIGHT};padding-top:12px;">{footer}</div></div></div>'
)