Part 4: NumPy for Data & ML

DS-MLOps Python Foundations

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Parts 1-3 (01-python-core.ipynb, 02-control-flow.ipynb, 03-python-patterns.ipynb). If you have not, start there.

Every ML library you will use (pandas, scikit-learn, PyTorch, TensorFlow) stores its data as NumPy arrays (or something modelled directly on them) and builds its fast operations on top of NumPy’s rules. This notebook teaches those rules from first principles, using the same university analytics platform running example from Parts 1-3: this time we build a small student feature matrix (study_hours, attendance_pct, prior_gpa) and use it to predict exam_score.

Topics covered

Topic	Why it matters
ndarray vs. list	Why ML libraries do not use plain Python lists
Creating arrays	Build feature matrices and synthetic data
Shape & dtype	Avoid silent shape-mismatch bugs
Indexing & slicing	Select rows, columns, and subsets fast
Boolean masking	Vectorised filtering and labelling
Aggregation & axis	Per-feature vs. per-sample statistics
Broadcasting	The rule behind every elementwise ML operation
Vectorisation	Why loops are slow and arrays are fast
Linear algebra basics	What is under the hood of a linear model
Saving & loading	Persist arrays between pipeline stages

Callout markers used throughout this notebook are explained on the book cover page.

Learning Objectives

#	Skill	Covered in
1	Explain why NumPy arrays outperform Python lists for numeric data	Sec. 1
2	Create arrays with `array`, `arange`, `zeros`, `ones`, and a modern `Generator`	Sec. 2
3	Inspect and reshape arrays using `shape`, `dtype`, `reshape`, and stacking	Sec. 3
4	Select data with slicing, fancy indexing, and boolean masks	Sec. 4, 5
5	Compute per-row / per-column statistics with `axis`	Sec. 6
6	Apply NumPy’s broadcasting rule to vectorise feature engineering	Sec. 7
7	Explain why vectorised code beats Python loops	Sec. 8
8	Use `@`/`np.dot` to express a linear model as matrix multiplication	Sec. 9
9	Save and reload arrays with `.npy` / `.npz`	Sec. 10
10	Recognise NumPy’s most common silent bugs (views, dtype, float equality)	Sec. 11

0. Meet NumPy

You have written Python for three chapters. You can loop, slice, and unpack. Now imagine you need to normalise 2,400 exam scores: subtract the mean, divide by the standard deviation, do it across three columns. A Python for loop would work. It would also be slow, verbose, and nothing like the code your colleagues expect to read.

NumPy (numpy.org) was created in 2005 by Travis Oliphant to give Python scientists the numeric performance of Fortran and C without leaving the language. The idea was simple: store numbers in a contiguous block of memory with a fixed type, and let a thin Python wrapper call heavily optimised C and Fortran routines on that block. Two decades later, nearly every numerical computing library in Python (pandas, scikit-learn, PyTorch, TensorFlow) uses NumPy arrays as its currency.

How it compares

Approach	Speed on large arrays	Readable math	When to use
Python `list` + loop	Slow (Python objects, GIL)	Verbose	Small collections, mixed types
NumPy `ndarray`	Fast (C/Fortran, contiguous)	*Concise (`a 2`)**	Numeric data of any size
PyTorch `Tensor`	Fast (optionally GPU)	Similar to NumPy	Deep learning, autodiff
JAX `Array`	Very fast (XLA, JIT, GPU/TPU)	NumPy-compatible	Research, differentiable programs
CuPy `ndarray`	GPU only	NumPy-compatible	Large-scale GPU computing

For everything up to classical ML on a laptop, NumPy is the right level of abstraction. PyTorch and JAX add complexity (device management, gradient tracking) that you do not need yet.

Already in your environment

NumPy is included in pyproject.toml. If you ever start a standalone project:

uv add numpy          # or: pip install numpy

Official docs and API reference: numpy.org/doc

1. Why NumPy? The `ndarray`

A Python list can hold anything (mixed types, nested objects), which makes it flexible but slow for numeric work: every element is a separate Python object, and arithmetic on a list means a Python-level loop.

A NumPy ndarray (“n-dimensional array”) is different: it stores one fixed dtype in a single contiguous block of memory. That uniformity lets NumPy hand the math off to compiled C/Fortran loops instead of the Python interpreter, often 10-100x faster, and with far less memory per element.

import numpy as np

study_hours = [12, 5, 18, 9, 22]

# A Python list has no element-wise arithmetic: this is string repetition, not math!
print("list  * 2 :", study_hours * 2)

hours = np.array(study_hours)
print("array * 2 :", hours * 2)  # element-wise multiplication

list  * 2 : [12, 5, 18, 9, 22, 12, 5, 18, 9, 22]
array * 2 : [24 10 36 18 44]

Memory layout: Python list vs NumPy ndarray

flowchart LR
    subgraph py ["Python list: scattered pointers"]
        P1["int obj"] & P2["float obj"] & P3["str obj"]
        L["list"] -->|ptr| P1
        L -->|ptr| P2
        L -->|ptr| P3
    end
    subgraph np ["NumPy ndarray: contiguous memory"]
        direction LR
        M1["64-bit float"] --- M2["64-bit float"] --- M3["64-bit float"]
        A["array"] --> M1
    end

    style np fill:#EBF5F0,stroke:#059669
    style py fill:#FEF2F2,stroke:#DC2626

Key Concept: ndarray = dtype + shape + contiguous memory

A NumPy array is homogeneous (one dtype for every element) and has a fixed shape. Because elements sit next to each other in memory, NumPy can vectorise operations: apply one compiled loop to the whole array instead of looping in Python.

Lists are general-purpose containers; arrays are numeric data structures. Use a list for a heterogeneous bag of objects, an array for a column of numbers.

2. Creating Arrays

The most direct way to create an array is np.array() from a Python list (or list of lists, for 2D data). For larger or synthetic data, NumPy provides dedicated creation functions so you never have to type out values by hand.

# 1D array: one feature for five students
study_hours = np.array([12, 5, 18, 9, 22])

# 2D array: rows = students, columns = features
# columns: [study_hours, attendance_pct, prior_gpa]
features = np.array(
    [
        [12, 85, 3.1],
        [5, 60, 2.4],
        [18, 95, 3.8],
        [9, 70, 2.9],
        [22, 98, 3.9],
    ]
)
print(features)
print(f"shape: {features.shape}")  # (5 students, 3 features)

[[12.  85.   3.1]
 [ 5.  60.   2.4]
 [18.  95.   3.8]
 [ 9.  70.   2.9]
 [22.  98.   3.9]]
shape: (5, 3)

For larger arrays, typing out every value is impractical. These functions build arrays from a rule instead of a literal list:

Function	Produces
`np.arange(start, stop, step)`	Evenly spaced integers/floats, like `range()`
`np.linspace(start, stop, n)`	`n` evenly spaced floats, inclusive of both ends
`np.zeros(shape)` / `np.ones(shape)`	Array filled with `0.0` / `1.0`
`np.full(shape, value)`	Array filled with a constant
`np.eye(n)`	`n x n` identity matrix

print("arange(0, 10)       :", np.arange(0, 10))
print("arange(0, 10, 2)    :", np.arange(0, 10, 2))
print("linspace(0, 1, 5)   :", np.linspace(0, 1, 5))
print("zeros((2, 3))       :\n", np.zeros((2, 3)))
print("ones(3)             :", np.ones(3))
print("full((2, 2), 7)     :\n", np.full((2, 2), 7))

arange(0, 10)       : [0 1 2 3 4 5 6 7 8 9]
arange(0, 10, 2)    : [0 2 4 6 8]
linspace(0, 1, 5)   : [0.   0.25 0.5  0.75 1.  ]
zeros((2, 3))       :
 [[0. 0. 0.]
 [0. 0. 0.]]
ones(3)             : [1. 1. 1.]
full((2, 2), 7)     :
 [[7 7]
 [7 7]]

Random Data with a `Generator`

Synthetic data and simulations need reproducible randomness. Modern NumPy (1.17+) recommends np.random.default_rng(seed) over the legacy np.random.seed(...). A Generator object is self-contained, so two generators never interfere with each other’s state (unlike the old global np.random.seed, which silently affects every call anywhere in the program).

rng = np.random.default_rng(seed=42)  # one independent, reproducible stream

print("uniform [0, 1)      :", rng.random(3))
print("normal(mean=0,std=1):", rng.normal(0, 1, size=3))
print("integers [60, 100)  :", rng.integers(60, 100, size=5))

uniform [0, 1)      : [0.77395605 0.43887844 0.85859792]
normal(mean=0,std=1): [ 0.94056472 -1.95103519 -1.30217951]
integers [60, 100)  : [89 90 88 91 80]

Pro Tip: Prefer default_rng over np.random.seed

np.random.seed(42) mutates a single global random state shared by your whole program. Any other code (or library) calling np.random.* shifts that shared state and breaks your reproducibility. rng = np.random.default_rng(42) gives you an isolated generator: pass it around explicitly, and your results stay reproducible no matter what else runs.

Activity 1 - Build a Synthetic Student Dataset

Goal: Using rng = np.random.default_rng(7), generate a feature matrix X of shape (50, 3) for 50 students with columns:

study_hours: rng.uniform(0, 25, size=50)
attendance_pct: rng.uniform(50, 100, size=50)
prior_gpa: rng.uniform(2.0, 4.0, size=50)

Combine the three 1D arrays into one (50, 3) array with np.column_stack.

X.shape  # -> (50, 3)
X[0]     # -> array([study_hours_0, attendance_pct_0, prior_gpa_0])

Hint: np.column_stack([a, b, c]) stacks 1D arrays as columns of a 2D array.

rng = np.random.default_rng(7)

study_hours = rng.uniform(0, 25, size=50)
attendance_pct = rng.uniform(50, 100, size=50)
prior_gpa = rng.uniform(2.0, 4.0, size=50)

X = ...  # TODO: combine the three arrays into one (50, 3) feature matrix

# print(f"X.shape : {X.shape}")
# print(f"X[0]    : {X[0]}")

3. Shape, Size, and dtype

Every array carries metadata you should check before trusting a computation: shape (size along each dimension), ndim (number of dimensions), size (total element count), and dtype (the single data type of every element).

X = np.array(
    [
        [12.0, 85.0, 3.1],
        [5.0, 60.0, 2.4],
        [18.0, 95.0, 3.8],
        [9.0, 70.0, 2.9],
    ]
)

print(f"shape : {X.shape}")  # (rows, columns)
print(f"ndim  : {X.ndim}")
print(f"size  : {X.size}")  # rows * columns
print(f"dtype : {X.dtype}")

shape : (4, 3)
ndim  : 2
size  : 12
dtype : float64

Common Mistake: Mixed int/float input silently upcasts

np.array([1, 2, 3.5]) produces a float64 array, not a mix of int and float. NumPy must pick one dtype for the whole array and silently widens every element to fit. This is usually harmless, but np.array([1, 2, 3], dtype=np.int32) / 2 truncating, or an unexpected int8 overflowing past 127, are the same root cause: always check .dtype when results look wrong.

Reshaping

reshape() returns the same data viewed with a different shape. It does not copy or reorder values, so the total element count (size) must stay the same. -1 tells NumPy “infer this dimension from the others”:

x = np.arange(12)
print(f"x          : {x}  shape={x.shape}")

x_grid = x.reshape(3, 4)
print(f"reshape(3,4):\n{x_grid}")

# -1 means "figure this dimension out for me"
x_col = x.reshape(-1, 1)  # turn a 1D array into a single column
print(f"reshape(-1,1) shape: {x_col.shape}")

x_flat = x_grid.flatten()  # back to 1D - always returns a COPY
print(f"flatten()   : {x_flat}")

x          : [ 0  1  2  3  4  5  6  7  8  9 10 11]  shape=(12,)
reshape(3,4):
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
reshape(-1,1) shape: (12, 1)
flatten()   : [ 0  1  2  3  4  5  6  7  8  9 10 11]

Common Mistake: reshape() returns a view, flatten() returns a copy

Mutating the result of .reshape() mutates the original array too. They share the same underlying memory. .flatten() always copies, so mutating it is safe. This distinction (view vs. copy) comes up constantly in NumPy; Sec. 11 covers it in more depth.

original = np.arange(6)
view = original.reshape(2, 3)
view[0, 0] = 99  # mutating the reshaped VIEW...

print(f"view     :\n{view}")
print(f"original : {original}")  # ...also changed the original!

view     :
[[99  1  2]
 [ 3  4  5]]
original : [99  1  2  3  4  5]

Combining Arrays: `column_stack`, `hstack`, `vstack`

Feature engineering often means assembling separate 1D feature vectors into one 2D matrix, or stacking two matrices together. Use the right function for the shape change you want:

Function	Effect
`np.column_stack([a, b, ...])`	1D arrays -> columns of a 2D array
`np.hstack([a, b])`	Join side-by-side (same number of rows)
`np.vstack([a, b])`	Stack on top of each other (same number of columns)
`np.concatenate([a, b], axis=...)`	General join along a chosen axis

gpa = np.array([3.1, 2.4, 3.8, 2.9])
attendance = np.array([85, 60, 95, 70])

# Two 1D feature vectors -> one (4, 2) matrix
combined = np.column_stack([gpa, attendance])
print(f"column_stack:\n{combined}")

# Two (4, 2) batches of students -> one (8, 2) matrix
more_students = np.array([[3.5, 90], [2.0, 55]])
all_students = np.vstack([combined, more_students])
print(f"vstack shape: {all_students.shape}")

column_stack:
[[ 3.1 85. ]
 [ 2.4 60. ]
 [ 3.8 95. ]
 [ 2.9 70. ]]
vstack shape: (6, 2)

4. Indexing and Slicing

NumPy indexing extends Python’s list slicing to multiple dimensions. For a 2D array, the convention is array[rows, columns], and negative indices still count from the end.

scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])

print(f"first three   : {scores[:3]}")
print(f"last three    : {scores[-3:]}")
print(f"between 3 & 7 : {scores[3:7]}")
print(f"every other   : {scores[::2]}")
print(f"reversed      : {scores[::-1]}")

first three   : [62 78 85]
last three    : [95 67 80]
between 3 & 7 : [91 55 73 88]
every other   : [62 85 55 88 67]
reversed      : [80 67 95 88 73 55 91 85 78 62]

X = np.array(
    [
        [12, 85, 3.1],
        [5, 60, 2.4],
        [18, 95, 3.8],
        [9, 70, 2.9],
        [22, 98, 3.9],
    ]
)

print(f"row 0              : {X[0]}")
print(f"column 1 (all rows): {X[:, 1]}")  # every row, attendance column
print(f"rows 1-3, col 0 & 2: \n{X[1:4, [0, 2]]}")  # fancy column indexing
print(f"single cell [2, 1] : {X[2, 1]}")

row 0              : [12.  85.   3.1]
column 1 (all rows): [85. 60. 95. 70. 98.]
rows 1-3, col 0 & 2: 
[[ 5.   2.4]
 [18.   3.8]
 [ 9.   2.9]]
single cell [2, 1] : 95.0

Common Mistake: Basic slices are views; fancy/boolean indexing copies

X[:, 1] (a slice) returns a view: mutating it mutates X. X[1:4, [0, 2]] (a list of indices, known as “fancy indexing”) always returns a copy. If you need an independent array from a slice, call .copy() explicitly: col = X[:, 1].copy().

Activity 2 - Select Top Performers

Goal: Given the X feature matrix above (columns: study_hours, attendance_pct, prior_gpa), use slicing to print:

The prior_gpa column (all rows)
The first two rows, all columns
The study_hours and prior_gpa columns (skip attendance) for every row

Hint: For (3), use fancy column indexing: X[:, [0, 2]].

X = np.array(
    [
        [12, 85, 3.1],
        [5, 60, 2.4],
        [18, 95, 3.8],
        [9, 70, 2.9],
        [22, 98, 3.9],
    ]
)

gpa_column = ...  # TODO
first_two_rows = ...  # TODO
hours_and_gpa = ...  # TODO

print(f"gpa_column     : {gpa_column}")
print(f"first_two_rows :\n{first_two_rows}")
print(f"hours_and_gpa  :\n{hours_and_gpa}")

gpa_column     : Ellipsis
first_two_rows :
Ellipsis
hours_and_gpa  :
Ellipsis

5. Boolean Masking & Vectorised Conditionals

Comparing an array to a value produces a boolean array of the same shape: a “mask.” Using that mask to index the original array keeps only the True positions. This replaces if/for filtering loops entirely.

scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])

passing_mask = scores >= 70
print(f"mask     : {passing_mask}")
print(f"passing  : {scores[passing_mask]}")  # boolean indexing: keeps True positions
print(f"n passing: {passing_mask.sum()}")  # True counts as 1, False as 0

mask     : [False  True  True  True False  True  True  True False  True]
passing  : [78 85 91 73 88 95 80]
n passing: 7

Combine conditions with & (and) / | (or), not Python’s and/or, which only work on single booleans, not arrays. Each side needs its own parentheses because &/| bind tighter than comparison operators:

attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])
scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])

# Parentheses are required: & binds tighter than >= without them
at_risk = (scores < 70) & (attendance < 70)
print(f"at_risk mask : {at_risk}")
print(f"n at risk    : {at_risk.sum()}")

at_risk mask : [False False False False False False False False  True False]
n at risk    : 1

np.where(condition, if_true, if_false) builds a new array by choosing between two values element-wise: the vectorised equivalent of a ternary expression inside a loop:

labels = np.where(scores >= 70, "pass", "fail")
print(labels)

# np.select handles more than two outcomes
grade = np.select(
    [scores >= 90, scores >= 80, scores >= 70, scores >= 60],
    ["A", "B", "C", "D"],
    default="F",
)
print(grade)

['fail' 'pass' 'pass' 'pass' 'fail' 'pass' 'pass' 'pass' 'fail' 'pass']
['D' 'C' 'B' 'A' 'F' 'C' 'B' 'A' 'D' 'B']

Activity 3 - Flag Students Needing Intervention

Goal: Given scores and attendance arrays, build a boolean mask needs_help that flags students with score < 70 or attendance < 60, then print how many students were flagged and their scores.

scores     = [62, 78, 85, 91, 55, 73, 88, 95, 67, 80]
attendance = [85, 60, 95, 70, 98, 45, 88, 92, 55, 80]
# needs_help -> True at indices 0, 4, 5, 8  (score<70 OR attendance<60)

scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])
attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])

needs_help = ...  # TODO: boolean mask, score < 70 OR attendance < 60

print(f"needs_help        : {needs_help}")
# print(f"n flagged         : {needs_help.sum()}")
# print(f"flagged scores    : {scores[needs_help]}")

needs_help        : Ellipsis

6. Aggregations Along an Axis

mean(), sum(), std(), min(), max() collapse an array to a single number by default. On a 2D matrix, the axis argument controls which dimension gets collapsed : this is the single most common source of “right function, wrong number” bugs in DS/ML code, so get the convention straight now:

axis=0 collapses rows -> one result per column (per feature)
axis=1 collapses columns -> one result per row (per sample)

X = np.array(
    [
        [12.0, 85.0, 3.1],
        [5.0, 60.0, 2.4],
        [18.0, 95.0, 3.8],
        [9.0, 70.0, 2.9],
        [22.0, 98.0, 3.9],
    ]
)

print(f"overall mean        : {X.mean():.2f}")  # one number, all 15 values
print(f"per-feature mean    : {X.mean(axis=0)}")  # shape (3,): one per column
print(f"per-student mean    : {X.mean(axis=1)}")  # shape (5,): one per row
print(f"per-feature std     : {X.std(axis=0)}")
print(f"per-feature min/max : {X.min(axis=0)} / {X.max(axis=0)}")

overall mean        : 32.67
per-feature mean    : [13.2  81.6   3.22]
per-student mean    : [33.36666667 22.46666667 38.93333333 27.3        41.3       ]
per-feature std     : [ 6.11228272 14.56845908  0.56356011]
per-feature min/max : [ 5.  60.   2.4] / [22.  98.   3.9]

Pro Tip: Say the axis name out loud

“axis=0” is easy to misremember. Read it as: “collapse axis 0 (the row axis): what’s left is one value per column.” If you want one statistic per feature (the usual case before normalising a feature matrix), that is always axis=0.

7. Broadcasting

Broadcasting is the rule NumPy uses to apply an operation between two arrays of different shapes, by virtually “stretching” the smaller one, without actually copying any data. It is what lets you write X - X.mean(axis=0) instead of a loop over rows.

Key Concept: The Broadcasting Rule

Compare shapes from the right-hand side. Two dimensions are compatible when they are equal, or when one of them is 1 (it gets stretched to match). Missing leading dimensions are treated as 1.

(5, 3) and (3,) → treat (3,) as (1, 3) → stretch to (5, 3). ✅
(5, 3) and (5,) → treat (5,) as (1, 5) → 3 != 5 and neither is 1. ❌

X = np.array(
    [
        [12.0, 85.0, 3.1],
        [5.0, 60.0, 2.4],
        [18.0, 95.0, 3.8],
        [9.0, 70.0, 2.9],
        [22.0, 98.0, 3.9],
    ]
)

feature_mean = X.mean(axis=0)  # shape (3,)  -- one mean per feature
feature_std = X.std(axis=0)  # shape (3,)

print(f"X.shape            : {X.shape}")
print(f"feature_mean.shape : {feature_mean.shape}")

# (5, 3) - (3,) broadcasts the mean across every row: no loop needed
X_normalised = (X - feature_mean) / feature_std
print(f"normalised:\n{X_normalised}")
print(f"new per-feature mean (~0): {X_normalised.mean(axis=0).round(6)}")

X.shape            : (5, 3)
feature_mean.shape : (3,)
normalised:
[[-0.196326    0.23338089 -0.21293203]
 [-1.34156098 -1.48265509 -1.45503555]
 [ 0.78530399  0.91979529  1.02917149]
 [-0.68714099 -0.7962407  -0.56781875]
 [ 1.43972398  1.1257196   1.20661485]]
new per-feature mean (~0): [ 0.  0. -0.]

Broadcasting: shapes align right, size-1 dimensions stretch

flowchart LR
    A["(3, 3) array
[[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]]"] -->|"+ scalar ()"| R1["broadcasts to (3,3)
[[11,12,13],
 [14,15,16],
 [17,18,19]]"]
    B["(3, 3) array"] -->|"+ row (1, 3)
[10, 20, 30]"| R2["row expands to (3,3)
[[11,22,33],
 [14,25,36],
 [17,28,39]]"]
    C["(3, 1) col"] -->|"+ (1, 3) row
shapes align on right"| R3["(3,3) result
outer-product-like"]

    style R1 fill:#EBF5F0,stroke:#059669,color:#065F46
    style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764

The diagram below makes the rule literal: the top row is feature_mean, shape (3,). NumPy stretches it down to align with every one of the 5 rows in X, without ever actually copying it 5 times in memory.

from ark.plot.diagrams import broadcasting_diagram

broadcasting_diagram();

Common Mistake: Broadcasting a (5,) array against a (5, 3) matrix fails for a subtle reason

If you instead compute per_student_mean = X.mean(axis=1) (shape (5,)) and try X - per_student_mean, NumPy raises ValueError: operands could not be broadcast together. It is comparing the trailing dimensions 3 vs 5, not what you meant. Fix it by giving the per-row result an explicit column shape with keepdims=True: X.mean(axis=1, keepdims=True) has shape (5, 1), which broadcasts correctly against (5, 3).

row_mean = X.mean(axis=1, keepdims=True)  # shape (5, 1), NOT (5,)
print(f"row_mean.shape : {row_mean.shape}")

# (5, 3) - (5, 1) broadcasts the per-row mean across every column
centered_per_row = X - row_mean
print(f"centered_per_row:\n{centered_per_row}")

row_mean.shape : (5, 1)
centered_per_row:
[[-21.36666667  51.63333333 -30.26666667]
 [-17.46666667  37.53333333 -20.06666667]
 [-20.93333333  56.06666667 -35.13333333]
 [-18.3         42.7        -24.4       ]
 [-19.3         56.7        -37.4       ]]

Activity 4 - Min-Max Scale Every Feature

Goal: Write a one-line expression that scales every column of X to the [0, 1] range using the formula (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)). Confirm the result’s per-column min is 0 and max is 1.

X_scaled.min(axis=0)  # -> array([0., 0., 0.])
X_scaled.max(axis=0)  # -> array([1., 1., 1.])

X = np.array(
    [
        [12.0, 85.0, 3.1],
        [5.0, 60.0, 2.4],
        [18.0, 95.0, 3.8],
        [9.0, 70.0, 2.9],
        [22.0, 98.0, 3.9],
    ]
)

X_scaled = ...  # TODO: min-max scale every column to [0, 1]

# print(f"min per column: {X_scaled.min(axis=0)}")
# print(f"max per column: {X_scaled.max(axis=0)}")

8. Vectorisation vs. Python Loops

Now that you can express normalisation as (X - mean) / std with no loop, it is worth seeing why that matters. Every NumPy operation you have used so far runs as a single compiled loop over contiguous memory; a Python for loop pays the cost of the interpreter on every single element.

import time

big = np.random.default_rng(0).normal(size=1_000_000)


def zscore_loop(values: np.ndarray) -> list[float]:
    mean = sum(values) / len(values)
    variance = sum((v - mean) ** 2 for v in values) / len(values)
    std = variance**0.5
    return [(v - mean) / std for v in values]


start = time.perf_counter()
_ = zscore_loop(big)
loop_time = time.perf_counter() - start

start = time.perf_counter()
_ = (big - big.mean()) / big.std()
vector_time = time.perf_counter() - start

print(f"Python loop time : {loop_time:.4f}s")
print(f"Vectorised time  : {vector_time:.4f}s")
print(f"Speedup          : {loop_time / vector_time:,.0f}x")

Python loop time : 0.4303s
Vectorised time  : 0.0153s
Speedup          : 28x

Pro Tip: If you are writing a for loop over an array, stop and look for a vectorised way

Almost every elementwise transformation, filter, or aggregation you would write as a Python loop already has a NumPy equivalent: arithmetic operators, np.where, boolean masks, axis aggregations. Reach for those first: a hand-written loop over a large array is one of the most common DS/ML performance bugs, and it is usually a 10-100x slowdown for no benefit.

9. Linear Algebra Essentials

A linear model’s prediction is a dot product: multiply each feature by a learned weight, sum the results, add a bias. @ (matrix multiplication) computes this for every row of a feature matrix at once, no loop over students required.

X = np.array(
    [
        [12.0, 85.0, 3.1],
        [5.0, 60.0, 2.4],
        [18.0, 95.0, 3.8],
        [9.0, 70.0, 2.9],
        [22.0, 98.0, 3.9],
    ]
)  # shape (5 students, 3 features)

# Suppose a (already-fitted) linear model has these learned weights and bias
weights = np.array([1.5, 0.3, 8.0])  # one weight per feature, shape (3,)
bias = 10.0

# X @ weights: (5, 3) @ (3,) -> (5,) -- one prediction per student
predicted_scores = X @ weights + bias
print(f"predicted_scores: {predicted_scores.round(1)}")

predicted_scores: [ 78.3  54.7  95.9  67.7 103.6]

@ on a matrix and a vector is shorthand for: for each row, multiply element-wise by weights and sum: exactly (X * weights).sum(axis=1). Verify the two are equivalent, then measure prediction error against the true scores with np.linalg.norm (the Euclidean / RMS-style distance):

# @ is equivalent to elementwise multiply + sum along axis=1
manual = (X * weights).sum(axis=1) + bias
print(f"@ matches manual sum: {np.allclose(predicted_scores, manual)}")

actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0])
errors = predicted_scores - actual_scores
rmse = np.linalg.norm(errors) / np.sqrt(len(errors))
print(f"errors : {errors.round(1)}")
print(f"RMSE   : {rmse:.2f}")

@ matches manual sum: True
errors : [ -9.7 -10.3   0.9 -10.3   4.6]
RMSE   : 8.10

Common Mistake: Comparing floats with ==

np.allclose(a, b) was used above instead of (a == b).all() on purpose: floating-point arithmetic accumulates tiny rounding errors, so two mathematically-equal results can differ in their last bit. Always compare floats with a tolerance: np.allclose, or abs(a - b) < 1e-9, never with exact ==.

10. Saving & Loading Arrays

A typical pipeline computes a feature matrix once and reuses it across many later steps (training, evaluation, serving). .npy stores a single array in NumPy’s own binary format: far smaller and faster to read than CSV for numeric data, and it preserves dtype and shape exactly. .npz bundles several named arrays together.

from pathlib import Path

tmp_dir = Path("tmp_numpy_activity")
tmp_dir.mkdir(exist_ok=True)

X = np.array([[12.0, 85.0, 3.1], [5.0, 60.0, 2.4], [18.0, 95.0, 3.8]])
y = np.array([88.0, 65.0, 95.0])

# Save a single array
np.save(tmp_dir / "X.npy", X)
X_loaded = np.load(tmp_dir / "X.npy")
print(f"round-trip equal: {np.array_equal(X, X_loaded)}")

# Save several named arrays together in one file
np.savez(tmp_dir / "dataset.npz", features=X, target=y)
bundle = np.load(tmp_dir / "dataset.npz")
print(f"keys   : {list(bundle.keys())}")
print(f"target : {bundle['target']}")

round-trip equal: True
keys   : ['features', 'target']
target : [88. 65. 95.]

import shutil

shutil.rmtree(tmp_dir)
print(f"cleaned up: {tmp_dir.exists()}")

cleaned up: False

11. Common Gotchas

Like the Python gotchas in Part 3, none of these raise an exception. They silently produce a wrong (or surprising) result. Recognise them now so you do not lose hours to them later.

# GOTCHA 1: basic slicing returns a VIEW, not a copy
scores = np.array([62, 78, 85, 91, 55])
top_three = scores[:3]
top_three[0] = 0  # mutating the "slice"...

print(f"top_three : {top_three}")
print(f"scores    : {scores}")  # ...changed the original too!

# Fix: copy explicitly when you need an independent array
safe_copy = scores[:3].copy()

top_three : [ 0 78 85]
scores    : [ 0 78 85 91 55]

# GOTCHA 2: integer dtype truncates on division-like ops, and can overflow
small = np.array([120, 10], dtype=np.int8)
print(f"int8 + 50 : {small + 50}")  # 120 + 50 = 170, overflows int8's max of 127!

# Fix: use a wide-enough dtype, or let NumPy infer (default is int64/float64)
safe = small.astype(np.int64) + 50
print(f"int64 + 50: {safe}")

int8 + 50 : [-86  60]
int64 + 50: [170  60]

# GOTCHA 3: {} is a dict, not a set -- same trap as in plain Python (Part 3, Sec. 8)
empty_dict = {}
empty_set = set()
print(f"type({{}})   : {type(empty_dict)}")
print(f"type(set()) : {type(empty_set)}")

# GOTCHA 4: comparing floats with == (see Sec. 9 for the fix: np.allclose)
a = np.array([0.1 + 0.2])
print(f"0.1 + 0.2 == 0.3 : {a == 0.3}")  # False! 0.30000000000000004 != 0.3
print(f"np.allclose      : {np.allclose(a, 0.3)}")  # True: tolerant comparison

type({})   : <class 'dict'>
type(set()) : <class 'set'>
0.1 + 0.2 == 0.3 : [False]
np.allclose      : True

12. Capstone Exercises

Apply everything from this notebook together. Each exercise is self-contained.

Exercise 1 - Build, Normalise, and Predict

Goal: Using the students dataset below:

Build a (6, 3) feature matrix X with np.column_stack
Z-score normalise X (Sec. 7)
Predict exam_score with the given weights/bias using @ (Sec. 9), applied to the normalised features
Compute the RMSE against actual_scores using np.linalg.norm

study_hours = np.array([12, 5, 18, 9, 22, 14])
attendance_pct = np.array([85, 60, 95, 70, 98, 80])
prior_gpa = np.array([3.1, 2.4, 3.8, 2.9, 3.9, 3.3])
actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0, 84.0])

weights = np.array([0.8, 0.5, 6.0])
bias = 55.0

# TODO: 1) build X, 2) normalise it, 3) predict, 4) compute RMSE
X = ...
X_normalised = ...
predicted = ...
rmse = ...

print(f"predicted: {predicted}")
print(f"RMSE     : {rmse:.2f}")

predicted: Ellipsis

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[30], line 16
     13 rmse = ...
     15 print(f"predicted: {predicted}")
---> 16 print(f"RMSE     : {rmse:.2f}")

TypeError: unsupported format string passed to ellipsis.__format__

Exercise 2 - Vectorised Anomaly Detector

Goal: Rewrite the deque-based anomaly detector from Part 3 (Sec. 9, Exercise 3) without any explicit loop. Flag any reading more than 2 standard deviations from the overall mean using a single boolean mask.

readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]
# Expected: reading 39.5 (index 5) flagged as anomaly

Hint: z = (readings - readings.mean()) / readings.std(), then mask np.abs(z) > 2.

readings = np.array([36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8])

z_scores = ...  # TODO
anomaly_mask = ...  # TODO

print(f"z_scores      : {z_scores.round(2)}")
print(f"anomaly_mask  : {anomaly_mask}")
print(f"anomalies     : {readings[anomaly_mask]}")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[31], line 6
      3 z_scores = ...  # TODO
      4 anomaly_mask = ...  # TODO
----> 6 print(f"z_scores      : {z_scores.round(2)}")
      7 print(f"anomaly_mask  : {anomaly_mask}")
      8 print(f"anomalies     : {readings[anomaly_mask]}")

AttributeError: 'ellipsis' object has no attribute 'round'

What’s New in NumPy 2.0

NumPy 2.0 (released June 2024) is the first major version bump in almost two decades. Two changes matter most for day-to-day data science work:

Removed type aliases

The old Python-builtin aliases (np.int, np.float, np.bool, np.complex, np.object, np.str) were deprecated for years and are now fully removed. They were just aliases to the Python built-ins anyway, so the fix is mechanical:

Old (removed)	Replacement
`np.bool`	`np.bool_` or Python `bool`
`np.int`	`np.intp` or Python `int`
`np.float`	`np.float64` or Python `float`
`np.complex`	`np.complex128` or Python `complex`
`np.object`	`np.object_` or Python `object`
`np.str`	`np.str_` or Python `str`

`StringDType` for proper string arrays

NumPy 2.0 introduced np.dtypes.StringDType(), a real variable-length string dtype backed by UTF-8 memory. The old np.str_ stored fixed-width UCS-4 strings (one array dtype for every character count), StringDType stores arbitrary-length strings efficiently.

Pro Tip: Use np.float64 not np.float in new code

If you see a module ‘numpy’ has no attribute ‘float’ error, the codebase was written against NumPy 1.x. A global search-and-replace of np.float → np.float64 (and so on for the others in the table) is the complete fix.

import numpy as np

# Old aliases are REMOVED in NumPy 2.0 — this would raise AttributeError:
# arr = np.array([1, 2, 3], dtype=np.float)   # ❌

# Use the explicit dtype names instead:
arr = np.array([1, 2, 3], dtype=np.float64)  # ✅
print(f"dtype: {arr.dtype}")

# StringDType: variable-length strings, efficient UTF-8 storage (NumPy 2.0+)
names = np.array(["Alice", "Bob", "Charlie"], dtype=np.dtypes.StringDType())
print(f"names : {names}")
print(f"dtype : {names.dtype}")

# The old fixed-width str_ is still available but StringDType is the modern choice
old_style = np.array(["Alice", "Bob", "Charlie"])  # infers str_ (fixed width)
print(f"old dtype: {old_style.dtype}")  # <U7 means Unicode, 7 chars wide

dtype: float64
names : ['Alice' 'Bob' 'Charlie']
dtype : StringDType()
old dtype: <U7

Resource	Why it matters
Harris, C.R. et al. (2020). Array programming with NumPy. Nature 585, 357–362.	The primary citation for NumPy; the paper explains the design decisions behind broadcasting and ufuncs
VanderPlas, J. (2016). Python Data Science Handbook, Ch. 2. O’Reilly.	Free online — the most readable treatment of fancy indexing, broadcasting, and structured arrays
NumPy documentation — Broadcasting	Official broadcasting rules with diagrams; bookmark for the next time the shapes don’t align
NumPy documentation — Indexing	Covers basic, advanced, and boolean indexing in one place

Summary

Concept	Key rule
`ndarray`	One dtype, contiguous memory, fast because it skips the Python interpreter
Creation	`np.array`, `arange`, `linspace`, `zeros`/`ones`; `np.random.default_rng(seed)` for reproducible random data
`shape`/`dtype`	Always check before trusting a result; mixed-type input silently upcasts
`reshape`	Returns a view, same data, same `size`, different shape
`flatten`	Always returns a copy
Slicing	Basic slices (`X[:, 0]`) are views; fancy/boolean indexing always copies
Boolean masks	`&` / `\\|` (not `and`/`or`) on arrays, each side parenthesised
`np.where` / `np.select`	Vectorised if/else and multi-branch labelling
`axis=0` vs `axis=1`	0 collapses rows -> one value per column; 1 collapses columns -> one value per row
Broadcasting	Compare shapes right-to-left; dims match if equal or one is `1`; use `keepdims=True` to broadcast a per-row stat
Vectorisation	A NumPy expression beats a Python loop by 10-100x, look for one before writing `for`
`@` / `np.dot`	Matrix multiplication: `X @ weights` predicts every row in one call
`np.allclose`	Always compare floats with a tolerance, never `==`
`.npy` / `.npz`	Compact, dtype/shape-preserving array storage between pipeline stages

Next: 05-matplotlib.ipynb, covering how to visualise arrays and DataFrames with matplotlib.