def weighted_grade(
midterm: float,
final: float,
project: float,
weights: tuple[float, float, float] = (0.30, 0.50, 0.20),
) -> float:
"""Compute a weighted final grade from three components.
Args:
midterm: Midterm exam score (0-100).
final: Final exam score (0-100).
project: Project score (0-100).
weights: (midterm_w, final_w, project_w) : must sum to 1.0.
Returns:
Weighted average on a 0-100 scale.
Raises:
ValueError: If weights do not sum to 1.0 within tolerance.
"""
w_mid, w_fin, w_proj = weights
if abs(w_mid + w_fin + w_proj - 1.0) > 1e-9:
raise ValueError(f"Weights must sum to 1.0; got {w_mid + w_fin + w_proj}")
return midterm * w_mid + final * w_fin + project * w_projPart 3: Patterns for Data Science & ML
DS-MLOps Python Foundations
Python 3.12+ | Author: Anthony Faustine
Before you begin
This notebook assumes you have completed Part 1 (01-python-core.ipynb) and Part 2 (02-control-flow.ipynb). If you have not, start there. The concepts here build directly on both.
Part 2 introduces professional coding patterns: the habits and structures that separate a working script from maintainable, production-grade code. These patterns are used every day in real data science and ML engineering work.
| Pattern | Why it matters |
|---|---|
| Functions | Reuse logic without copying code; make code testable |
| Lambda | Write concise callbacks for sorted(), map(), pandas .apply() |
| *args / **kwargs | Handle flexible inputs like scikit-learn and PyTorch do |
| Dataclasses | Typed, structured containers for configs and pipeline state |
| Modules | Organise code into files; use the standard library |
| Exceptions | Handle errors gracefully instead of crashing |
| pathlib | Read and write files safely, cross-platform |
The running example is the same university analytics platform from Part 1.
Callout markers used throughout this notebook are explained on the book cover page.
1. Functions
A function is a named, reusable block of code. You define it once and call it as many times as you need, with different inputs each time.
# Define once:
def greet(name):
print(f'Hello, {name}!')
# Call many times:
greet('Alice') # Hello, Alice!
greet('Bob') # Hello, Bob!Without functions, any repeated logic must be copy-pasted, and copy-pasted code means bugs fixed in one place but not the other. Functions are the foundation of all reusable, testable code.
Every function you write for production should have:
-
Type annotations on all parameters and the return value: mypy and your IDE use them to catch bugs. This is the same
name: typesyntax from Part 1, Sec. 1, just applied to function signatures. - A docstring that explains what the function does, its arguments, and what it returns. This project uses Google style.
The default parameter rule: defaults are evaluated once at definition time. Never use a mutable object (list, dict) as a default; see the Common Mistake callout in Sec. 8.
Call the function with default weights, then with custom weights. Keyword arguments make the call self-documenting:
# Default weights (0.30 mid / 0.50 final / 0.20 project)
grade = weighted_grade(midterm=82.0, final=91.0, project=88.0)
print(f"Default weights : {grade:.1f}")
# Override weights - the tuple must still sum to 1.0
custom = weighted_grade(82.0, 91.0, 88.0, weights=(0.20, 0.60, 0.20))
print(f"Custom weights : {custom:.1f}")Default weights : 87.7
Custom weights : 88.6
Functions can return multiple values packed as a tuple. Callers unpack them directly into named variables with a, b, c = func():
def score_summary(scores: list[float]) -> tuple[float, float, float, float]:
"""Return descriptive statistics for a score list.
Args:
scores: Non-empty list of numeric scores.
Returns:
Tuple of (mean, minimum, maximum, std_dev).
Raises:
ValueError: If scores is empty.
"""
if not scores:
raise ValueError("scores must not be empty")
n = len(scores)
mean = sum(scores) / n
variance = sum((s - mean) ** 2 for s in scores) / n
return mean, min(scores), max(scores), variance**0.5Python packs the four return values into a tuple. Unpack them in one line with tuple unpacking:
exam_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]
mean, lo, hi, std = score_summary(exam_scores)
print(f"mean={mean:.1f} min={lo} max={hi} std={std:.1f}")mean=83.9 min=67.0 max=95.0 std=8.8
Another essential DS function: z-score normalisation. It scales any value to “how many standard deviations from the mean”, a prerequisite for most ML models:
import statistics
def normalize(value: float, mean: float, std: float) -> float:
"""Compute the z-score of a single value.
Args:
value: The raw data point.
mean: Population or sample mean.
std: Standard deviation (must be non-zero).
Returns:
Z-score: positive = above average, negative = below average.
Raises:
ValueError: If std is zero (all values identical).
"""
if std == 0.0:
raise ValueError("std must be non-zero")
return (value - mean) / stdApply it to an exam score list. The z-scores make it immediately clear who is above/below the class mean:
exam_scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]
mu = statistics.mean(exam_scores)
sig = statistics.stdev(exam_scores)
print(f"mean={mu:.1f} std={sig:.1f}\n")
for score in exam_scores:
z = normalize(score, mu, sig)
label = "above avg" if z > 0 else "below avg"
print(f" {score:5.1f} z={z:+.2f} {label}")mean=79.8 std=11.4
72.0 z=-0.68 below avg
85.0 z=+0.46 above avg
91.0 z=+0.99 above avg
68.0 z=-1.03 below avg
88.0 z=+0.72 above avg
77.0 z=-0.24 below avg
94.0 z=+1.25 above avg
63.0 z=-1.47 below avg
Goal: Write a fully annotated function
classify_cohort that takes a list of scores and returns a dict mapping each grade letter to its count.
classify_cohort([95, 83, 71, 62, 45, 88, 76])
# -> {'A': 1, 'B': 2, 'C': 2, 'D': 1, 'F': 1}
Hint: Use a helper function grade_letter(score) -> str and Counter.
from collections import Counter
def grade_letter(score: float) -> str:
"""Return the letter grade for a numeric score."""
... # TODO
def classify_cohort(scores: list[float]) -> dict[str, int]:
"""Return grade-letter frequency counts for a cohort.
Args:
scores: List of numeric scores.
Returns:
Dict mapping each letter grade to its count.
"""
... # TODO
print(classify_cohort([95.0, 83.0, 71.0, 62.0, 45.0, 88.0, 76.0]))None
Goal: Write
accept_login(users, username, password) that takes a dict[str, str] of username→password pairs and returns True if the username exists and the password matches, False otherwise.users = {"alice": "ds2024", "bob": "ml#secure"}
accept_login(users, "alice", "ds2024") # True
accept_login(users, "alice", "wrong") # False (bad password)
accept_login(users, "carol", "any") # False (user not found)
Hint: Use dict.get() to avoid a KeyError on missing usernames.
def accept_login(users: dict[str, str], username: str, password: str) -> bool:
"""Return True if username exists and password matches."""
# TODO: implement
users: dict[str, str] = {"alice": "ds2024", "bob": "ml#secure"}
print(accept_login(users, "alice", "ds2024")) # True
print(accept_login(users, "alice", "wrong")) # False
print(accept_login(users, "carol", "any")) # FalseNone
None
None
2. Lambda Functions
A lambda is an anonymous (nameless) function defined in a single expression. It takes inputs on the left of : and produces a result on the right:
double = lambda x: x * 2
double(5) # -> 10Lambdas are most useful as short callbacks: a function you pass to another function rather than call yourself. You will use them constantly with sorted(), map(), filter(), and pandas .apply().
Key Concept: Anonymous Single-Expression Function
A lambda is a function with no name, a single expression, and an implicit return. Use it as a short callback for sorted(), map(), filter(), and especially pandas .apply().
If the body needs more than one expression, write a named function instead; lambdas must stay simple.
# Lambda syntax: lambda params: expression
square = lambda x: x**2 # noqa: E731
clamp = lambda x, lo, hi: max(lo, min(x, hi)) # noqa: E731
print(f"square(5) = {square(5)}")
print(f"clamp(150, 0, 100) = {clamp(150, 0, 100)}")square(5) = 25
clamp(150, 0, 100) = 100
The most common real-world use for lambdas is as a sort key: a function that extracts the comparison value from each element:
# Most common use: sort key: avoids writing a throwaway named function
students: list[dict[str, object]] = [
{"name": "Alice", "gpa": 3.95, "major": "CS"},
{"name": "Bob", "gpa": 3.45, "major": "Math"},
{"name": "Carol", "gpa": 3.88, "major": "CS"},
{"name": "Dan", "gpa": 3.72, "major": "Math"},
]
by_gpa = sorted(students, key=lambda s: s["gpa"], reverse=True)
by_major = sorted(students, key=lambda s: (s["major"], s["gpa"]))
print("By GPA (desc):")
for s in by_gpa:
print(f" {s['name']:<8} GPA={s['gpa']}")
print("\nBy major then GPA:")
for s in by_major:
print(f" {s['major']:<6} {s['name']:<8} GPA={s['gpa']}")By GPA (desc):
Alice GPA=3.95
Carol GPA=3.88
Dan GPA=3.72
Bob GPA=3.45
By major then GPA:
CS Carol GPA=3.88
CS Alice GPA=3.95
Math Bob GPA=3.45
Math Dan GPA=3.72
map() and filter()
map(func, iterable) applies func to every element. Both return lazy iterators – wrap with list() to materialise the result:
# map() applies a function to every element: returns a lazy iterator
raw_scores: list[str] = ["78.5", "85.0", "92.3", "61.0", "88.7"]
# Convert strings to floats
scores: list[float] = list(map(float, raw_scores))
print(f"Converted : {scores}")
# Normalise each score to 0-1
lo, hi = min(scores), max(scores)
normed: list[float] = list( # noqa: C417
map(lambda s: round((s - lo) / (hi - lo), 3), scores)
)
print(f"Normalised: {normed}")Converted : [78.5, 85.0, 92.3, 61.0, 88.7]
Normalised: [0.559, 0.767, 1.0, 0.0, 0.885]
filter(func, iterable) keeps only elements where func returns True. Equivalent to [x for x in items if func(x)] but more expressive with a named predicate:
# filter() keeps elements where the function returns True
passing: list[float] = list(filter(lambda s: s >= 70, scores))
print(f"Passing: {passing}")
# Preview: lambda is the backbone of pandas .apply()
# df['grade'] = df['score'].apply(lambda s: 'pass' if s >= 70 else 'fail')
# You will see this pattern in the Data Analysis module.Passing: [78.5, 85.0, 92.3, 88.7]
3. *args and **kwargs
What are *args and **kwargs?
Sometimes you want a function to accept any number of arguments without listing them all:
def add(*numbers): # *numbers collects all positional args into a tuple
return sum(numbers)
add(1, 2) # -> 3
add(1, 2, 3, 4) # -> 10 (same function, different number of args)This pattern is used by virtually every ML library: nn.Sequential(*layers), model.fit(X, y, **config), pd.concat([df1, df2], **options).
-
*argscollects any number of positional arguments into a tuple. -
**kwargscollects any number of keyword arguments into a dict.
You will encounter both constantly in scikit-learn, PyTorch, and FastAPI APIs: model.fit(X, y, **config), nn.Sequential(*layers).
# *args collects all positional arguments into a tuple
def ensemble_predict(*predictions: float) -> float:
"""Return the mean of any number of model predictions.
Args:
*predictions: Floats from individual model predictions.
Returns:
Mean of all predictions.
Raises:
ValueError: If no predictions are provided.
"""
if not predictions:
raise ValueError("At least one prediction required")
return sum(predictions) / len(predictions)Call with one value, three values, or unpack a list with *. The function signature stays the same in all three cases:
print(ensemble_predict(0.82)) # single model
print(ensemble_predict(0.82, 0.91, 0.87)) # three models
# * unpacks a list into positional arguments
model_preds: list[float] = [0.82, 0.91, 0.87, 0.79]
print(ensemble_predict(*model_preds)) # four models0.82
0.8666666666666667
0.8475
**kwargs collects any number of keyword arguments into a dict. Unpacking a dict with ** passes its contents as keyword arguments to another function:
# **kwargs collects all keyword arguments into a dict
def build_config(model: str, **hyperparams: object) -> dict[str, object]:
"""Assemble a model config dict from keyword arguments.
Args:
model: Model name identifier.
**hyperparams: Any additional hyperparameter key/value pairs.
Returns:
Config dict with model name plus all hyperparameters.
"""
return {"model": model} | dict(hyperparams)Pass any combination of keyword arguments. ** also unpacks a dict into keyword arguments at the call site:
cfg1 = build_config("xgboost", n_estimators=200, max_depth=6, learning_rate=0.05)
cfg2 = build_config("linear", C=1.0, penalty="l2")
print("Config 1:", cfg1)
print("Config 2:", cfg2)
# ** unpacks a dict into keyword arguments
base_params: dict[str, object] = {"n_estimators": 100, "max_depth": 4}
cfg3 = build_config("xgboost", **base_params, learning_rate=0.01)
print("Config 3:", cfg3)Config 1: {'model': 'xgboost', 'n_estimators': 200, 'max_depth': 6, 'learning_rate': 0.05}
Config 2: {'model': 'linear', 'C': 1.0, 'penalty': 'l2'}
Config 3: {'model': 'xgboost', 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.01}
You can combine all four argument forms in one signature: fixed positional, *args, keyword-only (after *), and **kwargs. This is the pattern used by scikit-learn, PyTorch, and FastAPI:
# All four kinds of argument in one signature:
# positional → run_id
# *args → tags (zero or more strings)
# keyword-only → verbose (must be named at the call site)
# **kwargs → metrics (any float metrics)
def log_run(
run_id: str,
*tags: str,
verbose: bool = False,
**metrics: float,
) -> None:
"""Log a training run with optional tags and metrics."""
tag_str = ", ".join(tags) if tags else "none"
metric_str = " ".join(f"{k}={v:.3f}" for k, v in metrics.items())
print(f"[{run_id}] tags=[{tag_str}] {metric_str}")
if verbose:
print(f" (full metrics: {metrics})")Call it with positional tags and keyword metric pairs. verbose is keyword-only. It cannot be passed positionally:
log_run("run-001", "baseline", "v1", accuracy=0.923, loss=0.218)
log_run("run-002", verbose=True, accuracy=0.934, precision=0.918)[run-001] tags=[baseline, v1] accuracy=0.923 loss=0.218
[run-002] tags=[none] accuracy=0.934 precision=0.918
(full metrics: {'accuracy': 0.934, 'precision': 0.918})
Goal: Write
log_metrics(epoch, **metrics) that prints a formatted line and returns a dict.
log_metrics(5, loss=0.312, accuracy=0.901, val_loss=0.334)
# prints:
# Epoch 05 | loss=0.3120 accuracy=0.9010 val_loss=0.3340
# returns:
# {'epoch': 5, 'loss': 0.312, 'accuracy': 0.901, 'val_loss': 0.334}
def log_metrics(epoch: int, **metrics: float) -> dict[str, object]:
"""Print and return a training metrics snapshot.
Args:
epoch: Current training epoch.
**metrics: Metric name -> value pairs.
Returns:
Dict containing epoch and all metrics.
"""
... # TODO
result = log_metrics(5, loss=0.312, accuracy=0.901, val_loss=0.334)
print(result)None
4. Dataclasses
A dataclass is a class (a blueprint for creating objects) where Python automatically generates the __init__, __repr__, and __eq__ methods from your field annotations. It is the modern replacement for the plain dict records from Part 1, Sec. 5, once the structure of your data is fixed and known ahead of time.
# Instead of this:
config = {'model': 'xgboost', 'lr': 0.001, 'epochs': 50}
# Use this - self-documenting, typed, and auto-validated:
@dataclass
class Config:
model: str
lr: float
epochs: intIf you have not used classes before, think of a class as a custom type you define. Dataclasses are the gentlest introduction. They need no understanding of inheritance or
selfbeyond what is shown here.
A
@dataclass (Python 3.7+) generates init, repr, and eq from field annotations automatically. It is the modern replacement for plain dicts when the shape of your data is known and fixed.When to use what:
-
dict: flexible, arbitrary keys, JSON-friendly -
NamedTuple: immutable record, tuple-compatible -
@dataclass: mutable typed object with methods; default for ML configs and pipeline state -
@dataclass(frozen=True): immutable dataclass; hashable, usable as dict key
from dataclasses import dataclass, field
@dataclass
class TrainingConfig:
"""Configuration for a single training run."""
model_name: str
learning_rate: float
epochs: int
batch_size: int = 32 # field with a default value
optimizer: str = "adam"
tags: list[str] = field(default_factory=list) # mutable default: use field()!
def is_fast_run(self) -> bool:
"""Return True if this is a quick smoke-test run (≤ 5 epochs)."""
return self.epochs <= 5@dataclass generates __init__, __repr__, and __eq__ automatically. Create an instance exactly like calling a function:
cfg = TrainingConfig(
model_name="xgboost-v2",
learning_rate=0.001,
epochs=50,
tags=["baseline", "production"],
)
print(cfg) # __repr__ generated: no boilerplate needed
print(f"Fast run : {cfg.is_fast_run()}")
print(f"Tags : {cfg.tags}")TrainingConfig(model_name='xgboost-v2', learning_rate=0.001, epochs=50, batch_size=32, optimizer='adam', tags=['baseline', 'production'])
Fast run : False
Tags : ['baseline', 'production']
Dataclass fields are mutable by default. __eq__ compares field values, not object identity: two separately created instances with the same fields are equal:
# Mutation: dataclass fields are mutable by default
cfg.epochs = 100
cfg.tags.append("extended")
print(f"Updated epochs: {cfg.epochs} tags: {cfg.tags}")
# Equality: __eq__ is generated automatically from field values
cfg2 = TrainingConfig("xgboost-v2", 0.001, 100, tags=["baseline", "production", "extended"])
print(f"cfg == cfg2: {cfg == cfg2}")Updated epochs: 100 tags: ['baseline', 'production', 'extended']
cfg == cfg2: True
frozen=True: Immutable, Hashable Dataclass
frozen=True prevents field mutation after creation and makes the object hashable – it can then be used as a dict key or placed in a set:
from dataclasses import dataclass
@dataclass(frozen=True)
class DatasetSplit:
"""Immutable description of a train/val/test split."""
train_size: float
val_size: float
test_size: float
random_seed: int = 42
def __post_init__(self) -> None:
total = self.train_size + self.val_size + self.test_size
if abs(total - 1.0) > 1e-9:
raise ValueError(f"Splits must sum to 1.0; got {total}")__post_init__ runs automatically after __init__. Any invalid split ratio is caught immediately on construction:
split = DatasetSplit(train_size=0.70, val_size=0.15, test_size=0.15)
print(split)
# Invalid split: __post_init__ raises ValueError
try:
bad = DatasetSplit(train_size=0.80, val_size=0.15, test_size=0.15)
except ValueError as exc:
print(f"Caught: {exc}")DatasetSplit(train_size=0.7, val_size=0.15, test_size=0.15, random_seed=42)
Caught: Splits must sum to 1.0; got 1.1
A frozen dataclass can be used as a dict key for result caching. Attempting to set a field after construction raises FrozenInstanceError:
# frozen=True means the object is hashable: usable as a dict key or set element
cache: dict[DatasetSplit, float] = {split: 0.923}
print(f"Cached accuracy: {cache[split]}")
# Attempting mutation raises FrozenInstanceError
try:
split.train_size = 0.80 # type: ignore[misc]
except Exception as exc:
print(f"Immutable: {exc}")Cached accuracy: 0.923
Immutable: cannot assign to field 'train_size'
5. Modules & the Standard Library
A module is a Python file. When you write import math, Python loads the file math.py from the standard library and makes its contents available under the name math.
import math
math.sqrt(9) # -> 3.0You can import your own code files the same way. Splitting code into modules is how real projects stay organised as they grow.
| import math |
import the whole module; access with math.sqrt()
|
| from math import sqrt, pi | import specific names; use directly |
| import numpy as np | alias (conventional for large packages) |
Prefer import module over from module import *. The star import pollutes the namespace and hides where names come from.
import math
# math: precise numeric operations
print(f"pi = {math.pi:.6f}")
print(f"sqrt(2) = {math.sqrt(2):.6f}")
print(f"log base 2(8) = {math.log2(8)}")
print(f"ceil(4.2) = {math.ceil(4.2)}")
print(f"floor(4.9) = {math.floor(4.9)}")pi = 3.141593
sqrt(2) = 1.414214
log base 2(8) = 3.0
ceil(4.2) = 5
floor(4.9) = 4
random: Sampling, Shuffling, and Simulation
random provides pseudo-random number generation for sampling, shuffling, and Monte Carlo simulation. Always call random.seed(n) at the start of any script that uses randomness. It makes results reproducible:
| Function | What it does |
|---|---|
random.seed(n) |
Fix the random state for reproducibility |
random.uniform(a, b) |
Float in [a, b] |
random.randint(a, b) |
Integer in [a, b] (both inclusive) |
random.choice(seq) |
One element from a sequence |
random.sample(seq, k) |
k unique elements (no replacement) |
random.shuffle(lst) |
Shuffle a list in place |
import random
# Always set a seed before any random operation for reproducibility
random.seed(42)
scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]
# shuffle: randomise in place (S311: not for crypto -- pedagogical demo)
shuffled = scores.copy()
random.shuffle(shuffled) # noqa: S311
print(f"Shuffled : {shuffled}")
# choice: pick one element at random
winner = random.choice(scores) # noqa: S311
print(f"Random winner: {winner}")
# sample: pick k unique elements (without replacement)
batch = random.sample(scores, k=3) # noqa: S311
print(f"Random batch : {batch}")
# uniform: float in [a, b]
points: list[float] = [random.uniform(0.0, 1.0) for _ in range(5)] # noqa: S311
print(f"Uniform [0,1]: {[round(p, 3) for p in points]}")
# randint: integer in [a, b] (inclusive both ends)
dice_rolls: list[int] = [random.randint(1, 6) for _ in range(10)] # noqa: S311
print(f"Dice rolls : {dice_rolls}")Shuffled : [68.0, 88.0, 94.0, 63.0, 91.0, 77.0, 72.0, 85.0]
Random winner: 85.0
Random batch : [85.0, 88.0, 68.0]
Uniform [0,1]: [0.032, 0.094, 0.233, 0.602, 0.561]
Dice rolls : [6, 6, 6, 5, 4, 2, 4, 5, 3, 1]
json: Serialise Python Objects
json converts Python dicts, lists, strings, numbers, and booleans to a JSON string and back. It is the standard format for saving model configs and experiment results:
import json
# json: serialise / deserialise Python objects to JSON strings
run_result: dict[str, object] = {
"run_id": "exp-2024-001",
"model": "xgboost",
"accuracy": 0.923,
"loss": 0.218,
"tags": ["baseline", "production"],
}
json_str: str = json.dumps(run_result, indent=2)
print("JSON string:")
print(json_str)
loaded: dict[str, object] = json.loads(json_str)
print(f"Round-trip accuracy: {loaded['accuracy']}")JSON string:
{
"run_id": "exp-2024-001",
"model": "xgboost",
"accuracy": 0.923,
"loss": 0.218,
"tags": [
"baseline",
"production"
]
}
Round-trip accuracy: 0.923
datetime represents a point in time. Always attach timezone.utc to avoid ambiguous “naive” datetime objects that can silently shift across time zones:
from datetime import UTC, datetime, timezone
# datetime: timestamp experiments and logs
now = datetime.now(tz=UTC)
print(f"Timestamp : {now.isoformat()}")
print(f"Date part : {now.strftime('%Y-%m-%d')}")Timestamp : 2026-06-18T09:54:16.238386+00:00
Date part : 2026-06-18
6. Exception Handling
An exception is an error that occurs while your program is running. By default, Python stops immediately and prints a traceback. Exception handling lets you catch the error, respond to it gracefully, and keep the program running.
# Without handling - program crashes:
int("abc") # ValueError: invalid literal for int() with base 10: 'abc'
# With handling - program continues:
try:
int("abc")
except ValueError:
print("That was not a number, skipping")In data pipelines and ML training loops, unhandled exceptions can discard hours of computation. Always handle errors at system boundaries (user input, file I/O, APIs).
-
try: code that might raise an exception -
except ExcType as e: handle a specific exception -
else: runs only if NO exception was raised intry -
finally: always runs, even if an exception propagates (use for cleanup)
Catch the most specific exception you can. Bare except: or except Exception: hides bugs and silences keyboard interrupts.
def parse_score(raw: str) -> float:
"""Parse a score string and validate it is in [0, 100].
Args:
raw: String representation of a numeric score.
Returns:
Validated float score.
Raises:
ValueError: If raw is not numeric or out of range.
"""
try:
value = float(raw)
except ValueError:
raise ValueError(f"{raw!r} is not a valid number") from None
if not 0 <= value <= 100:
raise ValueError(f"Score {value} is out of range [0, 100]")
return valueTest parse_score() against a range of inputs: valid numbers, out-of-range values, and non-numeric strings. The else clause runs only on the success path:
# Test parse_score against valid and invalid inputs
test_inputs: list[str] = ["87.5", "105", "abc", "-3", "72"]
for raw in test_inputs:
try:
score = parse_score(raw)
except ValueError as exc:
print(f" {raw!r:<8} -> ERROR: {exc}")
else:
print(f" {raw!r:<8} -> OK: {score}") '87.5' -> OK: 87.5
'105' -> ERROR: Score 105.0 is out of range [0, 100]
'abc' -> ERROR: 'abc' is not a valid number
'-3' -> ERROR: Score -3.0 is out of range [0, 100]
'72' -> OK: 72.0
finally runs regardless of whether an exception occurred or was handled. Use it for cleanup code (closing files, releasing connections) that must execute either way. This example illustrates the pattern using explicit open/close; in practice, always use with open(...) as fh instead (shown in Sec. 7):
# else runs ONLY when try succeeds; finally ALWAYS runs (cleanup guarantee)
def load_scores(filepath: str) -> list[float]:
"""Load numeric scores from a text file, one per line."""
fh = None
try:
fh = open(filepath, encoding="utf-8") # noqa: SIM115, PTH123
lines = fh.readlines()
except FileNotFoundError:
print(f"File not found: {filepath!r}")
return []
except PermissionError as exc:
print(f"Permission denied: {exc}")
return []
else:
print(f"Loaded {len(lines)} lines successfully")
return [float(line.strip()) for line in lines if line.strip()]
finally:
if fh is not None:
fh.close()
print("File handle closed")finally guarantees the file handle is closed even when the file does not exist: no resource leak is possible:
# NOTE: prefer `with open(...) as fh` in practice (shown in Sec. 7).
# This example uses explicit open/close to make else/finally visible.
result = load_scores("nonexistent.txt")
print(f"Result: {result}")File not found: 'nonexistent.txt'
Result: []
Custom Exception Classes
Subclass a built-in exception to give callers a specific type to catch. Store the structured context as instance attributes for programmatic access:
# Custom exception classes give callers something specific to catch
class DataValidationError(ValueError):
"""Raised when a data record fails validation."""
def __init__(self, field: str, value: object, reason: str) -> None:
self.field = field
self.value = value
self.reason = reason
super().__init__(f"Validation failed for {field!r}={value!r}: {reason}")Define a validator that raises DataValidationError with field-level context, then test it. The except DataValidationError clause catches only your custom type – not any accidental ValueError from elsewhere in the code:
def validate_student(record: dict[str, object]) -> None:
"""Validate a student record dict against required field constraints."""
gpa = record.get("gpa")
if not isinstance(gpa, int | float):
raise DataValidationError("gpa", gpa, "must be numeric")
if not 0.0 <= float(gpa) <= 4.0:
raise DataValidationError("gpa", gpa, "must be in [0.0, 4.0]")
name = record.get("name", "")
if not isinstance(name, str) or not name.strip():
raise DataValidationError("name", name, "must be a non-empty string")Test against valid and invalid records. The custom exception prints exactly which field failed and why:
test_records: list[dict[str, object]] = [
{"name": "Alice", "gpa": 3.95}, # valid
{"name": "Bob", "gpa": 5.0}, # GPA out of range
{"name": "", "gpa": 3.5}, # empty name
]
for rec in test_records:
try:
validate_student(rec)
print(f" {rec['name']!r:<10} -> valid")
except DataValidationError as exc:
print(f" {rec.get('name')!r:<10} -> {exc}") 'Alice' -> valid
'Bob' -> Validation failed for 'gpa'=5.0: must be in [0.0, 4.0]
'' -> Validation failed for 'name'='': must be a non-empty string
Goal: Write
parse_batch(rows) that returns (valid, errors): a list of successfully parsed floats and a list of error messages.
rows = ['85.0', '92', 'n/a', '-5', '78.5', '110', '63'] valid, errors = parse_batch(rows) # valid = [85.0, 92.0, 78.5, 63.0] # errors = ["'n/a' is not a valid number", # "'-5' out of range [0, 100]", # "'110' out of range [0, 100]"]
def parse_batch(rows: list[str]) -> tuple[list[float], list[str]]:
"""Parse a batch of score strings, separating valid from invalid.
Args:
rows: List of raw score strings.
Returns:
Tuple of (valid_scores, error_messages).
"""
valid: list[float] = []
errors: list[str] = []
# TODO: iterate rows, use parse_score from above, collect results
return valid, errors
rows: list[str] = ["85.0", "92", "n/a", "-5", "78.5", "110", "63"]
valid, errors = parse_batch(rows)
print(f"valid = {valid}")
print(f"errors = {errors}")valid = []
errors = []
7. File I/O with pathlib
File I/O (Input/Output) means reading data from files on disk and writing results back. Almost every data science workflow starts by loading a CSV, JSON, or Parquet file and ends by saving results somewhere.
pathlib.Path is the modern Python way to work with file paths. It is cross-platform (works on Windows, macOS, and Linux without changes) and composable:
from pathlib import Path
data_dir = Path('tutorials') / 'data' # / joins path parts
csv_file = data_dir / 'students.csv'
print(csv_file) # tutorials/data/students.csv Key Concept: pathlib.Path, the Modern Way to Handle Paths
Since Python 3.4, pathlib.Path is the standard for file-system work. It is cross-platform, composable with /, and carries methods for existence checks, directory creation, and reading/writing, all in one object.
Always use with open(…) as fh (context manager) so the file is closed automatically, even if an exception occurs.
Common Mistake: Bare String Paths
open(‘data/file.csv’) works but gives you no path-manipulation methods and is fragile on Windows vs. macOS/Linux. Use Path(‘data’) / ‘file.csv’ instead.
from pathlib import Path
# Path composition: / operator joins parts, cross-platform
project_root: Path = Path()
data_dir: Path = project_root / "data"
output_file: Path = data_dir / "results" / "run_001.json"
print(f"data_dir : {data_dir}")
print(f"output_file : {output_file}")data_dir : data
output_file : data/results/run_001.json
Every Path object knows its own parts: no string slicing to extract a filename or extension. mkdir(exist_ok=True) is the safest way to create a directory (no error if it already exists):
from pathlib import Path
# Path properties: inspect parts of a path without string slicing
p = Path("tutorials/03-data-analysis/data/primary.csv")
print(f"p.name : {p.name}") # 'primary.csv'
print(f"p.stem : {p.stem}") # 'primary'
print(f"p.suffix : {p.suffix}") # '.csv'
print(f"p.parent : {p.parent}") # 'tutorials/03-data-analysis/data'
print(f"p.parts : {p.parts}")
# Safe directory creation: no error if it already exists
tmp_dir = Path("tmp_activity")
tmp_dir.mkdir(exist_ok=True)
print(f"\ntmp_dir exists: {tmp_dir.exists()}")p.name : primary.csv
p.stem : primary
p.suffix : .csv
p.parent : tutorials/03-data-analysis/data
p.parts : ('tutorials', '03-data-analysis', 'data', 'primary.csv')
tmp_dir exists: True
Reading & Writing Files
Always use the with statement. It closes the file automatically, even if an exception occurs. DictWriter writes rows as dicts keyed by column name:
import csv
from pathlib import Path
tmp = Path("tmp_activity")
tmp.mkdir(exist_ok=True)
csv_path = tmp / "students.csv"
rows: list[dict[str, object]] = [
{"name": "Alice Kamau", "gpa": 3.95, "major": "CS"},
{"name": "Bob Mwangi", "gpa": 3.45, "major": "Math"},
{"name": "Carol Osei", "gpa": 3.88, "major": "CS"},
]
with csv_path.open("w", newline="", encoding="utf-8") as fh:
writer = csv.DictWriter(fh, fieldnames=["name", "gpa", "major"])
writer.writeheader()
writer.writerows(rows)
print(f"Wrote: {csv_path}")Wrote: tmp_activity/students.csv
DictReader reads each row back as a dict keyed by header names, with no positional index access needed:
import csv
from pathlib import Path
csv_path = Path("tmp_activity") / "students.csv"
with csv_path.open(encoding="utf-8") as fh:
reader = csv.DictReader(fh)
loaded: list[dict[str, str]] = list(reader)
for row in loaded:
print(f" {row['name']:<15} GPA={row['gpa']} {row['major']}") Alice Kamau GPA=3.95 CS
Bob Mwangi GPA=3.45 Math
Carol Osei GPA=3.88 CS
For single-document JSON files, Path.write_text() + json.dumps() and Path.read_text() + json.loads() is the most concise round-trip:
import json
from pathlib import Path
tmp = Path("tmp_activity")
json_path = tmp / "run_result.json"
run_data = {"run_id": "exp-001", "accuracy": 0.923, "tags": ["baseline"]}
# Write: write_text is the cleanest one-liner for JSON
json_path.write_text(json.dumps(run_data, indent=2), encoding="utf-8")
print(f"Wrote: {json_path}")
# Read: read_text + json.loads
reloaded: dict[str, object] = json.loads(json_path.read_text(encoding="utf-8"))
print(f"Read back: {reloaded}")Wrote: tmp_activity/run_result.json
Read back: {'run_id': 'exp-001', 'accuracy': 0.923, 'tags': ['baseline']}
Finding Files
Path.iterdir() yields the immediate children of a directory. Path.rglob(pattern) searches the entire subtree recursively:
from pathlib import Path
tmp = Path("tmp_activity")
print("Files in tmp_activity:")
for f in sorted(tmp.iterdir()):
size = f.stat().st_size
print(f" {f.name:<30} {size:>6} bytes")Files in tmp_activity:
run_result.json 78 bytes
students.csv 79 bytes
rglob('*.ipynb') finds all matching files at any depth. After exploring, clean up the temporary directory with shutil.rmtree():
from pathlib import Path
import shutil
# rglob: recursive search by pattern
notebooks = list(Path("tutorials").rglob("*.ipynb"))
print(f"Notebooks found: {len(notebooks)}")
for nb in sorted(notebooks)[:5]:
print(f" {nb}")
# Clean up tmp directory
tmp = Path("tmp_activity")
shutil.rmtree(tmp)
print(f"\nCleaned up: {tmp} exists = {tmp.exists()}")Notebooks found: 0
Cleaned up: tmp_activity exists = False
Creating & Checking Directories
Path.mkdir() creates directories; Path.exists() and Path.is_dir() check state without raising an error. Always prefer mkdir(parents=True, exist_ok=True) over conditionally calling os.makedirs():
from pathlib import Path
results_dir = Path("results") / "experiment_001"
print(f"Exists before : {results_dir.exists()}")
# parents=True creates any missing parent directories
# exist_ok=True is silent if the directory already exists
results_dir.mkdir(parents=True, exist_ok=True)
print(f"Exists after : {results_dir.exists()}")
print(f"Is directory : {results_dir.is_dir()}")
# Write a file into the new directory
log_file = results_dir / "metrics.txt"
log_file.write_text("accuracy=0.923\nval_loss=0.218\n")
print(f"Log file size : {log_file.stat().st_size} bytes")
# Clean up
log_file.unlink()
results_dir.rmdir()
results_dir.parent.rmdir()
print("Cleaned up")Exists before : False
Exists after : True
Is directory : True
Log file size : 30 bytes
Cleaned up
Goal: Write a function that appends an experiment result as a JSON line to a log file, then reads and prints all logged runs.
log_experiment(Path('runs.jsonl'), run_id='run-001', accuracy=0.901, loss=0.312)
log_experiment(Path('runs.jsonl'), run_id='run-002', accuracy=0.923, loss=0.218)
# runs.jsonl contents:
# {"run_id": "run-001", "accuracy": 0.901, "loss": 0.312}
# {"run_id": "run-002", "accuracy": 0.923, "loss": 0.218}
Hint: JSONL (JSON Lines), one JSON object per line, is the standard format for streaming experiment logs. Use mode=‘a’ to append.
import json
from pathlib import Path
def log_experiment(log_path: Path, **metrics: object) -> None:
"""Append an experiment result as a JSON line to log_path.
Args:
log_path: Path to the .jsonl log file (created if absent).
**metrics: Any metric name/value pairs to record.
"""
... # TODO
log_path = Path("runs.jsonl")
if log_path.exists():
log_path.unlink() # start fresh for this activity
log_experiment(log_path, run_id="run-001", accuracy=0.901, loss=0.312)
log_experiment(log_path, run_id="run-002", accuracy=0.923, loss=0.218)
# Read back and print all runs
print("Logged runs:")
for line in log_path.read_text(encoding="utf-8").splitlines():
run = json.loads(line)
print(f" {run}")
log_path.unlink() # clean upLogged runs:
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[46], line 24 22 # Read back and print all runs 23 print("Logged runs:") ---> 24 for line in log_path.read_text(encoding="utf-8").splitlines(): 25 run = json.loads(line) 26 print(f" {run}") File /usr/lib/python3.12/pathlib.py:1029, in Path.read_text(self, encoding, errors) 1025 """ 1026 Open the file in text mode, read it, and close the file. 1027 """ 1028 encoding = io.text_encoding(encoding) -> 1029 with self.open(mode='r', encoding=encoding, errors=errors) as f: 1030 return f.read() File /usr/lib/python3.12/pathlib.py:1015, in Path.open(self, mode, buffering, encoding, errors, newline) 1013 if "b" not in mode: 1014 encoding = io.text_encoding(encoding) -> 1015 return io.open(self, mode, buffering, encoding, errors, newline) FileNotFoundError: [Errno 2] No such file or directory: 'runs.jsonl'
Why study gotchas?
The bugs in this section are silent: they do not raise an error. Python happily runs the code and produces the wrong answer. These patterns appear in real data pipelines and ML training scripts and can waste hours of debugging time.
Read through them now and you will recognise them instantly in the wild.
8. Common Gotchas
Key Concept: Bugs That Are Hard to See
The following patterns cause real bugs in data pipelines and ML code. None of them raise an exception. They silently produce wrong results. Learn to recognise them now so you never spend hours debugging them later.
# GOTCHA 1: Mutable default argument
# The default [] is created ONCE at function definition time - shared across all calls!
def append_score_bad(score: float, history: list[float] = []) -> list[float]: # noqa: B006
history.append(score)
return history
print("Bad default: the list leaks between calls:")
print(append_score_bad(82.0)) # [82.0] expected
print(append_score_bad(91.0)) # [82.0, 91.0] WRONG: previous call leaked in!Bad default: the list leaks between calls:
[82.0]
[82.0, 91.0]
The fix: use None as the default sentinel and create a fresh list inside the function body on each call:
def append_score(score: float, history: list[float] | None = None) -> list[float]:
if history is None:
history = [] # new list created on every call where history is not provided
history.append(score)
return history
print("Fixed: independent list each time:")
print(append_score(82.0)) # [82.0]
print(append_score(91.0)) # [91.0] fresh list
# Rule: never use a mutable object (list, dict, set) as a default argument value.
# With @dataclass use field(default_factory=list) instead (shown in Sec. 4).Fixed: independent list each time:
[82.0]
[91.0]
Gotcha 2: assignment is not a copy. b = a creates a second name for the same list. Shallow .copy() creates a new outer container but inner objects are still shared. Use copy.deepcopy() for fully independent nested structures:
# GOTCHA 2: Assignment is NOT a copy
# For nested structures, .copy() is a SHALLOW copy: inner objects are still shared.
import copy
original: list[list[int]] = [[1, 2], [3, 4]]
ref = original # same object
shallow_copy = original.copy() # new outer list, shared inner lists
deep_copy = copy.deepcopy(original) # completely independent
original[0].append(99)
print(f"original : {original}") # [[1, 2, 99], [3, 4]]
print(f"ref : {ref}") # [[1, 2, 99], [3, 4]] : same object
print(f"shallow_copy : {shallow_copy}") # [[1, 2, 99], [3, 4]] : inner list shared!
print(f"deep_copy : {deep_copy}") # [[1, 2], [3, 4]] : fully independentoriginal : [[1, 2, 99], [3, 4]]
ref : [[1, 2, 99], [3, 4]]
shallow_copy : [[1, 2, 99], [3, 4]]
deep_copy : [[1, 2], [3, 4]]
Gotcha 3: / vs //. Both divide, but // floors toward negative infinity, not toward zero. Note that // on floats returns a float, not an int:
# GOTCHA 3: / vs //: easy to confuse
print(f"7 / 2 = {7 / 2}") # 3.5 : true division, always float
print(f"7 // 2 = {7 // 2}") # 3 : floor, NOT truncate
print(f"-7 // 2 = {-7 // 2}") # -4 : floors toward negative infinity
print(f"7.5//2 = {7.5 // 2}") # 3.0 : floor of float is still float7 / 2 = 3.5
7 // 2 = 3
-7 // 2 = -4
7.5//2 = 3.0
Gotcha 4 & 5: {} is a dict, and truthiness is not None-ness. Python treats 0, 0.0, None, [], and '' all as falsy, which silently breaks the common value or default pattern when 0.0 is a legitimate result:
# GOTCHA 4: {} creates a dict, not a set
empty1 = {}
empty2 = set()
print(f"type({{}}) : {type(empty1)}")
print(f"type(set()) : {type(empty2)}")
# GOTCHA 5: Boolean short-circuit: 0.0, None, '', [] are all falsy
score: float | None = None
result = score or 0.0 # 0.0: but breaks if score is legitimately 0.0!
print(f'\n0.0 or "default" : {0.0 or "default"}') # noqa: SIM222
# Prefer an explicit None check
result2 = score if score is not None else 0.0
print(f"Explicit check : {result2}")type({}) : <class 'dict'>
type(set()) : <class 'set'>
0.0 or "default" : default
Explicit check : 0.0
9. Capstone Exercises
Apply everything from Parts 1-3. Each exercise is self-contained. Attempt them without looking at previous sections first.
Goal: Given a list of student dicts, produce a formatted text report.
students = [
{'name': 'Alice', 'scores': [88, 92, 85], 'major': 'CS'},
{'name': 'Bob', 'scores': [62, 70, 58], 'major': 'Math'},
{'name': 'Carol', 'scores': [91, 95, 89], 'major': 'CS'},
]
# Expected output:
# Name Major Avg Grade
# Alice CS 88.3 B
# Carol CS 91.7 A
# Bob Math 63.3 D
# (sorted by average score, descending)
students: list[dict[str, object]] = [
{"name": "Alice", "scores": [88, 92, 85], "major": "CS"},
{"name": "Bob", "scores": [62, 70, 58], "major": "Math"},
{"name": "Carol", "scores": [91, 95, 89], "major": "CS"},
]
# TODO: produce the formatted report
...Ellipsis
Goal: Define an
Experiment dataclass, populate a list of runs, then print the best run by validation accuracy.
@dataclass
class Experiment:
run_id: str
model: str
val_accuracy: float
config: TrainingConfig # from Sec. 4
# Expected output:
# Best run: Experiment(run_id='run-002', model='xgboost', val_accuracy=0.934, ...)
from dataclasses import dataclass
@dataclass
class Experiment:
"""Record for a single training experiment."""
run_id: str
model: str
val_accuracy: float
# TODO: add more fields as needed
runs: list[Experiment] = [
Experiment("run-001", "xgboost", 0.901),
Experiment("run-002", "xgboost", 0.934),
Experiment("run-003", "linear", 0.881),
]
# TODO: find and print the best run
best: Experiment = ...
print(f"Best run: {best}")Best run: Ellipsis
Goal: Using a
deque of size 5, flag any reading that deviates more than 2 standard deviations from the window mean.
readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8] # Expected: reading 39.5 flagged as anomaly
Hint: Use score_summary() from Sec. 1 (or inline the calculation).
from collections import deque
readings: list[float] = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]
WINDOW_SIZE: int = 5
THRESHOLD_STD: float = 2.0
window: deque[float] = deque(maxlen=WINDOW_SIZE)
# TODO: detect and print anomalies
for _reading in readings:
...A raw metric stream (sensor readings, training loss per step) can be noisy. A moving window average smooths it by replacing each value with the average of its neighbours within a fixed window.
Task: Implement
moving_window_average(x, n_neighbors):-
For each element, average the
n_neighborselements on each side plus the element itself - At the edges, pad with the boundary value (repeat the first/last element as needed)
- Return a list the same length as the input
losses = [0.95, 0.82, 0.91, 0.78, 0.65, 0.70, 0.60, 0.55] moving_window_average(losses, n_neighbors=1) # window of 3: each value = mean of itself + 1 left + 1 right # → [0.887, 0.893, 0.837, 0.780, 0.710, 0.650, 0.617, 0.575]
Then: compute the average for n_neighbors in 1–4 and print the range (max − min) of each smoothed list. Does the range shrink as the window grows? Why?
def moving_window_average(x: list[float], n_neighbors: int = 1) -> list[float]:
"""Replace each value with the mean of its n_neighbors on each side.
Args:
x: Input list of floats.
n_neighbors: Number of neighbours on each side to include.
Returns:
Smoothed list of the same length as x.
"""
n = len(x)
width = n_neighbors * 2 + 1
# Pad: repeat first/last element for missing neighbours at edges
padded = [x[0]] * n_neighbors + x + [x[-1]] * n_neighbors
# TODO: return the windowed means
return [sum(padded[i : i + width]) / width for i in range(n)]
training_loss: list[float] = [0.95, 0.82, 0.91, 0.78, 0.65, 0.70, 0.60, 0.55]
print("Original :", training_loss)
for k in range(1, 5):
smoothed = moving_window_average(training_loss, n_neighbors=k)
rng = round(max(smoothed) - min(smoothed), 4)
print(f" n={k}: range={rng} {[round(v, 3) for v in smoothed]}")Original : [0.95, 0.82, 0.91, 0.78, 0.65, 0.7, 0.6, 0.55]
n=1: range=0.34 [0.907, 0.893, 0.837, 0.78, 0.71, 0.65, 0.617, 0.567]
n=2: range=0.326 [0.916, 0.882, 0.822, 0.772, 0.728, 0.656, 0.61, 0.59]
n=3: range=0.3086 [0.901, 0.859, 0.823, 0.773, 0.716, 0.677, 0.626, 0.593]
n=4: range=0.27 [0.879, 0.851, 0.812, 0.768, 0.723, 0.679, 0.649, 0.609]
Further Reading
| Resource | Why it matters |
|---|---|
| PEP 484 — Type Hints | The original proposal; reading it explains why certain type annotation rules exist |
| Google Python Style Guide | The docstring format used throughout this notebook comes from here |
| Van Rossum, G., Warsaw, B. & Coghlan, N. (2001). PEP 8 — Style Guide for Python Code. | The canonical style reference for all Python code |
| Hunt, A. & Thomas, D. (1999). The Pragmatic Programmer. | Chapters on DRY, orthogonality, and design-by-contract translate directly to the patterns in this notebook |
Summary
| Concept | Key rule |
|---|---|
| Functions | Annotate all params and return types; use Google-style docstrings |
| Defaults | Never use a mutable object as a default, use None and check inside |
| Lambda | Use for short key functions: sorted(items, key=lambda x: x['score']) |
*args |
Collect variable positional args into a tuple; unpack a list with *list |
**kwargs |
Collect variable keyword args into a dict; unpack a dict with **dict |
@dataclass |
Generated __init__, __repr__, __eq__; use frozen=True for immutability |
field(default_factory=list) |
The correct way to give a dataclass a mutable default |
| Imports | import module over from module import *; use conventional aliases |
| Exceptions | except SpecificError not bare except; use else for success path, finally for cleanup |
pathlib.Path |
Cross-platform paths; compose with /; read with .read_text(), write with .write_text() |
| Context manager | with open(...) as fh always, file closes automatically |
| Gotchas | Mutable defaults, = is not copy, // floors, {} is dict |
Next: 04-numpy.ipynb, covering arrays, broadcasting, and vectorised operations with NumPy.