This is the one notebook in Part 3, because type annotations are pure Python: running annotated functions live shows the gap between what Python accepts at runtime and what a type checker would flag statically. Every example is executable.
Callout markers used throughout this notebook are explained on the book cover page.
NoteLearning Objectives
By the end of Part 15 you will be able to:
#
Skill
Covered in
1
Explain why type annotations matter in DS code
Sec. 1
2
Write annotated function signatures with basic types
Sec. 2
3
Annotate numpy arrays with NDArray and pandas DataFrames
Sec. 3
4
Use TypeAlias and Protocol for complex DS types
Sec. 4
5
Interpret ty check output and fix type errors
Sec. 5
6
Apply gradual typing: where to start and what to skip
The annotated version is self-documenting: any editor with a type checker installed will warn you the moment you pass "82" instead of 82.0. The unannotated version silently computes "82" * 0.30 = "82828282828282828282828282828282828282828282828282828282828282". That is a real Python behavior, not a hypothetical.
Python does not enforce annotations at runtime. That is the job of a static type checker. The annotation is documentation that a machine can check.
Key Concept: Annotations are documentation a machine can check
They tell collaborators and your future self what a function expects and returns, without writing a word of prose. A type checker like ty reads them and flags type mismatches before the code runs.
2. Basic Annotations
# The basic scalar types used in grade-predictorfrom __future__ import annotations # enables newer type syntax on Python 3.10+def compute_grade( midterm: float, final: float, project: float, weights: tuple[float, float, float] = (0.30, 0.45, 0.25),) ->float:ifabs(sum(weights) -1.0) >0.001:raiseValueError(f"weights must sum to 1, got {sum(weights):.3f}")return midterm * weights[0] + final * weights[1] + project * weights[2]compute_grade(80.0, 85.0, 90.0)
84.75
Python runs compute_grade("82", 85.0, 90.0) without raising an error. The annotation is a contract, not a runtime check:
# Python does NOT enforce annotations at runtimeresult = compute_grade("82", 85.0, 90.0) # no error from Pythonprint(result) # "82" * 0.30 -> TypeError: can't multiply sequence by non-int of type 'float'# Actually raises TypeError here, but only because float multiplication fails on str# With an int weight it would silently produce wrong output
---------------------------------------------------------------------------TypeError Traceback (most recent call last)
CellIn[2], line 2 1# Python does NOT enforce annotations at runtime----> 2 result = compute_grade("82",85.0,90.0)# no error from Python 3print(result) # "82" * 0.30 -> TypeError: can't multiply sequence by non-int of type 'float' 4# Actually raises TypeError here, but only because float multiplication fails on str 5# With an int weight it would silently produce wrong outputCellIn[1], line 13, in compute_grade(midterm, final, project, weights) 11ifabs(sum(weights) - 1.0) > 0.001:
12raiseValueError(f"weights must sum to 1, got {sum(weights):.3f}")
---> 13returnmidterm*weights[0] + final * weights[1] + project * weights[2]
TypeError: can't multiply sequence by non-int of type 'float'
The full set of basic types used in DS function signatures:
Type
Use for
int
counts, indices
float
scores, rates, measurements
str
labels, column names, IDs
bool
flags, binary outcomes
int or float
either, when both are valid
float or None
an optional numeric value
list[float]
a sequence of floats
tuple[float, float, float]
a fixed-length sequence
dict[str, float]
a mapping from string keys to float values
def grade_to_letter(average: float) ->str:if average >=85:return"A"elif average >=70:return"B"elif average >=55:return"C"elif average >=45:return"D"return"F"def flag_at_risk(score: float|None, threshold: float=50.0) ->bool:if score isNone:returnTrue# missing score is treated as at-riskreturn score < thresholddef grade_summary(midterm: float, final: float, project: float) ->dict[str, float]: avg = compute_grade(midterm, final, project)return {"average": avg, "midterm": midterm, "final": final, "project": project}grade_summary(80.0, 85.0, 90.0)
Goal: Annotate these three signatures. Include one with a float | None parameter for a nullable score, one that returns dict[str, float], and one that takes a list[str] of column names.
This is the gap in most Python type annotation tutorials. DS code is full of numpy arrays and pandas DataFrames, and the annotations for them are not obvious.
For numpy, use NDArray from numpy.typing:
import numpy as npfrom numpy.typing import NDArrayimport pandas as pddef normalize(X: NDArray[np.float64]) -> NDArray[np.float64]: mean = X.mean(axis=0) std = X.std(axis=0)return (X - mean) / std# NDArray[np.float64] is a typed array: a 2D array of 64-bit floatsscores = np.array([[80.0, 85.0, 90.0], [70.0, 75.0, 80.0]])normalize(scores)
array([[ 1., 1., 1.],
[-1., -1., -1.]])
For pandas, pd.DataFrame is the practical annotation, even though it carries no column-level information:
Pro Tip: pd.DataFrame is practical; pandera adds column types
pd.DataFrame is a useful annotation even though it carries no column information. The next step is pandera.typing.DataFrame[Schema], which encodes column names and dtypes at the type level. Start with pd.DataFrame and graduate to pandera when you need column-level guarantees in a data pipeline.
Activity 2 - Annotate Array and DataFrame Functions
Goal: Write and annotate two functions: one that takes NDArray[np.float64] and returns a normalized array, and one that takes a pd.DataFrame and returns a filtered pd.DataFrame. Confirm both run correctly on the sample DataFrame above.
When the same complex type appears in many function signatures, give it a name. In Python 3.12, the type keyword creates a type alias clearly and without imports:
evaluate(model: Predictor, …) accepts any object with predict and fit methods: sklearn’s LinearRegression, XGBRegressor, a custom class. No import of sklearn needed in the type signature. This is structural subtyping, and it keeps your utility functions independent of any specific ML library.
5. Running ty
Install ty and run it on the grade-predictor source:
uv add --optional dev tyuv run ty check src/
Reading the output: each line is file:line:col: error[code] message. Errors must be fixed. Warnings are suggestions.
Common errors in DS code:
# Simulate what ty would flag:# 1. Return type mismatchdef get_threshold() ->float:return"50.0"# str, not float -- ty flags this# 2. Argument type mismatchdef double_score(score: float) ->float:return score *2result = double_score("82") # str passed as float -- ty flags this# 3. Optional not handleddef safe_grade(score: float|None) ->str:return grade_to_letter(score) # score might be None -- ty flags this# Correct versiondef safe_grade_fixed(score: float|None) ->str:if score isNone:return"N/A"return grade_to_letter(score)safe_grade_fixed(None)
'N/A'
Configure ty in pyproject.toml:
[tool.ty]python-version="3.12"
The --ignore-missing-imports flag suppresses errors from third-party packages that lack type stubs. Pandas stubs are partial; great-tables has no stubs. Use it when third-party noise hides real errors in your own code.
Activity 3 - Run ty and Fix Errors
Goal: Add full type annotations to core.py. Run uv run ty check src/. Fix every error (not warning) that ty reports in your own code. Confirm the output is clean before moving on.
uv run ty check src/
# Fix each error line by line
uv run ty check src/ # should report 0 errors
# TODO: annotate core.py fully and run ty check
6. Gradual Typing: Where to Start
You do not need to annotate everything at once. Gradual typing means adding annotations incrementally, in the order that buys the most value.
Priorities for a DS codebase: 1. Public function signatures first: what callers see 2. Return types before argument types: return type mismatches catch more bugs 3. Skip internal helpers and one-off notebook cells initially 4. Use Any as a placeholder when you need to annotate something complex you will refine later
Any is not giving up. It is a marker that says: this is unannotated, I know it, I will return to it.
from typing import Any# Acceptable as a placeholder during gradual annotationdef process_raw_data(data: Any) -> pd.DataFrame:# Will be refined once the input schema is settledreturn pd.DataFrame(data)# The same function with a more specific type once the schema is knowndef process_records(data: list[dict[str, Any]]) -> pd.DataFrame:return pd.DataFrame(data)# Test bothprocess_records([{"student_id": "S0001", "midterm_score": 80.0}])
student_id
midterm_score
0
S0001
80.0
Common Mistake: Annotating everything at once
Trying to annotate a 2000-line codebase in a single session produces two outcomes: you give up halfway, or you annotate things badly and introduce incorrect type information that misleads the checker. Start with the five most-called public functions. Get them clean. Move on.
Capstone: Fully Annotate core.py
Bring the grade-predictor/src/grade_predictor/core.py to zero type errors.
Capstone - Zero ty Errors
Goal:
Annotate every function in core.py: compute_grade, grade_to_letter, flag_at_risk, add_average_marks
Use NDArray[np.float64] for any numpy array parameters
Use pd.DataFrame and pd.Series for pandas types
Run uv run ty check src/ and bring it to zero errors