Part 17: Testing DS Code with pytest

View Source on GitHub

DS-MLOps Dev Tools

Python 3.12+ | Author: Anthony Faustine

Before you begin

This chapter assumes you have completed Part 13 through Part 16. The grade-predictor project should have typed, linted code under version control. This chapter writes tests for it.

The testing patterns here apply to any DS function or data pipeline step, not just grade-predictor. If you have written tests before, focus on sections 4 and 5: they cover DataFrame testing and schema validation, which are rarely covered in standard pytest tutorials.

Topics covered

Topic	Why it matters
What pytest is	A test runner that replaces unittest with far less boilerplate
Test discovery and structure	pytest finds and runs tests automatically without registration
`@pytest.mark.parametrize`	One test body, many inputs: all boundary cases in a table
Fixtures and `conftest.py`	Reusable, scoped test data shared across all test files
DataFrame testing	`assert_frame_equal`, shape checks, dtype and schema assertions
Coverage measurement	Which code paths run; enforce a minimum with `--cov-fail-under`
Mocking I/O	Replace file reads and API calls with controlled substitutes
Modern built-in fixtures	`tmp_path`, `monkeypatch`, `capfd`: no import, no extra package

Callout markers used throughout this chapter are explained on the book cover page.

Learning Objectives

By the end of Part 17 you will be able to:

#	Skill	Covered in
0	Explain what pytest is, how it compares to unittest, and install it	Sec. 0
1	Write and run pytest tests for DS functions using essential CLI flags	Sec. 1
2	Use `@pytest.mark.parametrize` to test multiple inputs without duplicating code	Sec. 2
3	Write fixtures that provide reusable sample DataFrames	Sec. 3
4	Test pandas transforms with `assert_frame_equal`, `pytest.approx`, and schema assertions	Sec. 4
5	Test exception handling with `pytest.raises`	Sec. 5
6	Measure test coverage and set a minimum threshold	Sec. 6
7	Organize a test suite for a DS project	Sec. 7
8	Mock external dependencies (APIs, file reads) with `unittest.mock.patch`	Sec. 8
9	Use built-in pytest fixtures: `tmp_path`, `monkeypatch`, `capfd`	Sec. 9

0. What is pytest and Why Use It

Python ships with a built-in test runner called unittest. It works, but it forces you to subclass TestCase, call self.assertEqual(...) instead of assert, and write substantial boilerplate before a single assertion runs. pytest is the community standard for a reason: it discovers tests automatically, uses plain assert statements, and provides richer output with no extra effort.

flowchart LR
    A["unittest"] -->|"subclass TestCase\nself.assertEqual\nsetUp / tearDown"| B["works\nbut verbose"]
    C["pytest"] -->|"plain assert\nautodiscovery\nrich output"| D["same tests\nhalf the code"]

    style B fill:#FEF2F2,stroke:#DC2626,color:#991B1B
    style D fill:#EBF5F0,stroke:#059669,color:#065F46

Install pytest and the two plugins you will use throughout this chapter:

uv add --dev pytest pytest-cov pytest-mock

pytest-cov measures which lines of code are exercised. pytest-mock wraps unittest.mock.patch as a fixture so mocking reads more cleanly.

Verify the installation:

uv run pytest --version

Key Concept: pytest discovers tests by naming convention

pytest collects any file matching test_.py or _test.py, and within those files any function starting with test_. No registration, no base class, no import of pytest is required in the test file itself. Name your files and functions correctly and pytest finds them automatically.

1. Why DS Code Needs Tests

A normalization function applied after the train/test split instead of before. A fillna(0) that should have been fillna(df["score"].mean()). A boolean column that silently became an integer. These bugs do not raise exceptions; they produce wrong outputs quietly.

In DS, silent bugs are the hard ones. A test suite converts “I think this is right” into “this is verifiably correct, and stays correct when the code changes.”

flowchart LR
    A["pytest discovers\ntest_*.py files"] --> B["run @pytest.fixture\nsetup phase"]
    B --> C["execute test body\nassert statements"]
    C --> D{assert passes?}
    D -->|yes| E["PASSED"]
    D -->|no| F["FAILED\nfull traceback"]
    E & F --> G["teardown\n(yield fixtures)"]
    G --> H["next test"]

    style E fill:#EBF5F0,stroke:#059669,color:#065F46
    style F fill:#FEF2F2,stroke:#DC2626,color:#991B1B

Here is the simplest possible test for compute_grade:

# tests/test_core.py
from grade_predictor.core import compute_grade

def test_compute_grade_defaults():
    result = compute_grade(midterm=80.0, final=85.0, project=90.0)
    assert abs(result - 84.25) < 0.01   # 0.30*80 + 0.45*85 + 0.25*90

Run it:

uv run pytest tests/ -v

The -v flag shows each test name and its pass/fail status. Without it, pytest shows only a dot per passing test. A handful of flags cover the vast majority of what you need day-to-day:

Flag	What it does
`-v`	Verbose: show each test name
`-x`	Stop on first failure
`-k "grade"`	Run only tests whose name contains “grade”
`--tb=short`	Shorter traceback: location and assertion only, no full stack
`-q`	Quiet: one character per test, good for large suites
`--last-failed`	Re-run only the tests that failed in the previous run

uv run pytest tests/ -x --tb=short   # stop on first failure with compact output
uv run pytest tests/ -k "parametrize" # run tests matching a keyword

Key Concept: A test is an executable specification

test_compute_grade_defaults says: given these inputs, this function must return this value. If the function ever changes and breaks this, the test fails immediately. A failing test is not a problem; it is useful information. A passing test on wrong code is the actual problem.

2. Parametrize: One Test, Many Inputs

Writing one test function per boundary case produces a lot of duplicated code. @pytest.mark.parametrize runs the same test body once per row in a table of inputs.

Grade boundaries in grade-predictor:

import pytest
from grade_predictor.core import grade_to_letter

@pytest.mark.parametrize("midterm,final,project,expected", [
    (90.0, 92.0, 88.0, "A"),   # composite >= 85
    (75.0, 78.0, 80.0, "B"),   # composite >= 70
    (55.0, 60.0, 58.0, "C"),   # composite >= 55
    (40.0, 42.0, 45.0, "D"),   # composite >= 45
    (20.0, 25.0, 30.0, "F"),   # composite < 45
])
def test_grade_letter_boundaries(midterm, final, project, expected):
    assert grade_to_letter(midterm, final, project) == expected

Each row becomes a separate test case in the output:

tests/test_core.py::test_grade_letter_boundaries[90.0-92.0-88.0-A] PASSED
tests/test_core.py::test_grade_letter_boundaries[75.0-78.0-80.0-B] PASSED
...

Adding a new edge case is one extra row. Without parametrize, it is a new function. Parametrize also makes it easy to spot which case failed: the bracket notation shows the exact inputs that produced the failure.

Activity 1 - Parametrize a Boundary Test

Goal: Write a normalize_score(raw, min_val, max_val) function in core.py that maps a raw score to the 0-100 range. Write a parametrized test covering: raw equals min_val returns 0, raw equals max_val returns 100, and one midpoint. Confirm all three cases pass.

@pytest.mark.parametrize("raw,min_val,max_val,expected", [
    (0.0, 0.0, 100.0, 0.0),   # raw == min
    (100.0, 0.0, 100.0, 100.0),
    (50.0, 0.0, 100.0, 50.0),
])
def test_normalize_score(raw, min_val, max_val, expected): ...

3. Fixtures for Reusable Test Data

A fixture is a function that provides a value to a test. Any test that declares the fixture’s name as a parameter receives it automatically. Fixtures live in conftest.py so they are shared across all test files.

# tests/conftest.py
import pandas as pd
import pytest

@pytest.fixture
def sample_df() -> pd.DataFrame:
    return pd.DataFrame({
        "student_id": ["S0001", "S0002", "S0003"],
        "midterm_score": [80.0, None, 60.0],
        "final_score": [85.0, 70.0, 55.0],
        "project_score": [90.0, 75.0, 65.0],
        "program": ["CS", "DS", "IT"],
        "passed": [True, True, False],
    })

Use it in any test file:

def test_flag_at_risk(sample_df):
    from grade_predictor.core import flag_at_risk
    result = flag_at_risk(sample_df, threshold=65.0)
    assert result.sum() == 1   # only S0003 is at risk

Fixture scope controls how often the fixture runs:

Scope	Created	Right for
`function` (default)	Before each test	Small synthetic DataFrames
`module`	Once per file	Medium-sized fixtures
`session`	Once per pytest run	Loading a large CSV from disk

For a fixture that loads university_analytics.csv, use session scope. Reading 2,400 rows once per run is fast enough; reading it 40 times (once per test) adds up.

@pytest.fixture(scope="session")
def university_df() -> pd.DataFrame:
    return pd.read_csv("data/university_analytics.csv")

Activity 2 - Session-Scoped Fixture

Goal: Write a session-scoped fixture in conftest.py that loads a 50-row sample from university_analytics.csv (use .head(50)). Write three tests that each use this fixture: one that checks column names, one that checks there are no null values in student_id, and one that confirms passed is boolean.

@pytest.fixture(scope="session")
def small_df():
    return pd.read_csv("data/university_analytics.csv").head(50)

4. Testing pandas Transforms

Three patterns cover almost every DataFrame test in DS code.

Pattern 1: Shape and column presence

def test_add_average_marks_adds_column(sample_df):
    from grade_predictor.core import add_average_marks
    result = add_average_marks(sample_df)
    assert "average_marks" in result.columns
    assert result.shape == (3, 7)           # original 6 columns + 1 new

Pattern 2: pd.testing.assert_frame_equal

def test_filter_passing_students(sample_df):
    from grade_predictor.core import filter_passing
    result = filter_passing(sample_df)
    expected = sample_df.iloc[[0, 1]].reset_index(drop=True)
    pd.testing.assert_frame_equal(result, expected, check_like=True)

check_like=True ignores column and row order. This is usually what you want when testing a filter: the data should match, but the order may vary.

Pattern 3: dtype and schema assertions

def test_output_dtypes(sample_df):
    from grade_predictor.core import add_average_marks
    result = add_average_marks(sample_df)
    assert result["average_marks"].dtype == "float64"
    assert result["passed"].dtype == "bool"

Pattern 4: pytest.approx for scalar float comparisons

When testing a single computed value rather than a full DataFrame, use pytest.approx instead of ==. It handles floating-point rounding automatically:

import pytest
from grade_predictor.core import compute_grade

def test_compute_grade_weighted():
    # 0.30*80 + 0.45*85 + 0.25*90 = 84.25
    result = compute_grade(midterm=80.0, final=85.0, project=90.0)
    assert result == pytest.approx(84.25, rel=1e-4)

pytest.approx also works on lists and dicts of floats: assert [0.1 + 0.2, 0.3] == pytest.approx([0.3, 0.3]).

These four patterns together test the shape, the values, the types, and numeric precision. A transform that passes all four is well-specified.

Common Mistake: assert df1.equals(df2) for floats

DataFrame.equals uses exact equality. Floating-point arithmetic means 0.30 * 80 + 0.45 * 85 + 0.25 * 90 may not equal 84.25 to the last bit on all platforms. pd.testing.assert_frame_equal has check_exact=False and rtol and atol parameters for numeric tolerance. Use it for all DataFrame comparisons.

Activity 3 - Test a Transform

Goal: Write an add_average_marks(df) function in core.py that adds a column average_marks = (midterm + final + project) / 3. Write three tests: shape assertion, assert_frame_equal on a manually computed expected DataFrame, and a dtype check. All three should pass.

import pandas as pd

def test_add_average_marks_shape(sample_df):
    result = add_average_marks(sample_df)
    assert result.shape == (3, 7)

5. Testing Exception Handling

A function that raises a clear exception on bad input is better than one that silently returns a wrong answer. Test the exception with pytest.raises:

def test_invalid_weights_raises():
    with pytest.raises(ValueError, match="weights must sum to 1"):
        compute_grade(80.0, 85.0, 90.0, weights=(0.5, 0.5, 0.5))

The match argument checks that the error message contains the given string. Without match, the test passes for any ValueError, even one from an unrelated part of the code. With match, you confirm that the right error, from the right place, with the right message, was raised.

# tests/test_core.py
@pytest.mark.parametrize("weights", [
    (0.5, 0.5, 0.5),    # sums to 1.5
    (0.1, 0.1, 0.1),    # sums to 0.3
    (-0.3, 0.8, 0.5),   # negative weight
])
def test_invalid_weights_all_raise(weights):
    with pytest.raises(ValueError):
        compute_grade(80.0, 85.0, 90.0, weights=weights)

Activity 4 - Test an Exception

Goal: Add a ValueError to compute_grade when any weight is negative or the weights do not sum to approximately 1.0 (within 0.001). Write a parametrized test that covers three invalid cases. Confirm all three raise ValueError and that a valid call still passes.

6. Coverage

Coverage measures which lines of code are executed during the test suite. Lines that are never executed are either dead code or untested paths.

uv run pytest tests/ --cov=grade_predictor --cov-report=term-missing

The term-missing report shows which specific lines are not covered. A line number in the “missing” column means no test exercises that path.

Enforce a minimum threshold in pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "--cov=grade_predictor --cov-report=term-missing --cov-fail-under=80"

With --cov-fail-under=80, pytest exits with a non-zero code if coverage drops below 80%. CI catches this automatically.

The right target for DS library code is 80% to 85%. 100% is often counterproductive: it forces testing trivial property accessors and error messages that add no value. What matters is covering every code path that can produce a wrong answer silently.

Pro Tip: Focus coverage on the paths that matter

A missing line on a trivial return self._name property is not worth a test. A missing line on the normalization branch of a preprocessing function is. Read the “missing” column as a checklist of untested logic paths, not a number to maximize for its own sake.

7. Organizing a DS Test Suite

The standard structure for a DS project follows the test pyramid: many fast unit tests at the base, fewer integration tests in the middle, and a small number of end-to-end tests at the top.

flowchart TD
    E2E["End-to-End\nfull pipeline: CSV in, report out\nfew, slow, run in CI only"]
    INT["Integration\nmodule boundaries: pipeline + io\nsome, seconds, run pre-push"]
    UNIT["Unit\nsingle functions: compute_grade, flag_at_risk\nmany, milliseconds, run on every save"]

    UNIT -->|"build confidence"| INT
    INT -->|"build confidence"| E2E

    style E2E fill:#FEF2F2,stroke:#DC2626,color:#991B1B
    style INT fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style UNIT fill:#EBF5F0,stroke:#059669,color:#065F46

tests/
├── conftest.py            # shared fixtures: sample_df, university_df
├── unit/
│   ├── test_core.py       # individual function tests
│   └── test_config.py     # settings and environment variable loading
└── integration/
    └── test_pipeline.py   # end-to-end: load CSV, transform, compute grades

Unit tests test one function in isolation with synthetic data. They run in milliseconds and should always run. Integration tests test the full pipeline with real data from disk. They run in seconds and should run in CI and before releases.

Run only unit tests locally during development, all tests in CI:

uv run pytest tests/unit/ -v                    # fast, local
uv run pytest tests/ -v --override-ini=addopts=  # full suite, CI

8. Mocking External Dependencies

Some code cannot be tested with real data: a function that calls an API, reads from a database, or writes to a file. Mocking replaces that dependency with a controlled substitute for the duration of the test.

unittest.mock.patch is the standard tool. It temporarily replaces an object at a given import path with a MagicMock that you control:

# grade_predictor/io.py
import requests

def fetch_course_list(api_url: str) -> list[dict]:
    """Fetch the current course list from the university API."""
    response = requests.get(api_url)
    response.raise_for_status()
    return response.json()

Testing this without hitting the network:

# tests/unit/test_io.py
from unittest.mock import patch, MagicMock
from grade_predictor.io import fetch_course_list

def test_fetch_course_list_returns_parsed_json():
    mock_response = MagicMock()
    mock_response.json.return_value = [{"course_id": "C01", "name": "Stats"}]
    mock_response.raise_for_status.return_value = None

    with patch("grade_predictor.io.requests.get", return_value=mock_response) as mock_get:
        result = fetch_course_list("https://api.example.com/courses")

    mock_get.assert_called_once_with("https://api.example.com/courses")
    assert result == [{"course_id": "C01", "name": "Stats"}]

Key Concept: Patch at the import location, not the definition location

requests.get is defined in the requests library, but fetch_course_list imports it into grade_predictor.io. Patching requests.get would not work because the function already has a reference to the original. Patch “grade_predictor.io.requests.get”: the name as it appears inside the module under test. This is the single most common mock mistake.

For pandas-based code, mock the file read so tests do not depend on a real CSV being present:

from unittest.mock import patch
import pandas as pd

def test_load_students_returns_dataframe():
    fake_csv = pd.DataFrame({
        "student_id": ["S0001", "S0002"],
        "final_score": [85.0, 72.0],
    })

    with patch("grade_predictor.io.pd.read_csv", return_value=fake_csv):
        from grade_predictor.io import load_students
        result = load_students("any/path.csv")

    assert len(result) == 2
    assert "final_score" in result.columns

pytest-mock is a thin wrapper that makes the same pattern slightly cleaner as a fixture:

# pyproject.toml
[project.optional-dependencies]
test = ["pytest", "pytest-cov", "pytest-mock"]

# Same test with pytest-mock's mocker fixture
def test_fetch_course_list_with_mocker(mocker):
    mock_get = mocker.patch("grade_predictor.io.requests.get")
    mock_get.return_value.json.return_value = [{"course_id": "C01"}]
    mock_get.return_value.raise_for_status.return_value = None

    result = fetch_course_list("https://api.example.com/courses")
    assert result[0]["course_id"] == "C01"

Common Mistake: Over-mocking

If a test mocks every function the code under test calls, it is no longer testing the code: it is testing the mock. A unit test should run the real logic and mock only the I/O boundary: network calls, file reads, database queries. A function that computes a grade from numbers does not need any mocking. A function that reads grades from an API does. Mock the API; run the computation logic for real.

Activity 5 - Mock a CSV Load

Goal: Write a load_grades(path: str) -> pd.DataFrame function in grade_predictor/io.py that calls pd.read_csv(path). Write a test that mocks pd.read_csv to return a two-row DataFrame, calls load_grades(“any/path.csv”), and asserts the result has two rows without reading a real file.

from unittest.mock import patch
import pandas as pd

def test_load_grades_mocked():
    fake_df = pd.DataFrame({"student_id": ["S0001", "S0002"], "score": [80.0, 70.0]})
    with patch("grade_predictor.io.pd.read_csv", return_value=fake_df):
        result = load_grades("any/path.csv")
    assert len(result) == 2

9. Modern Built-in Fixtures

pytest ships three fixtures that cover a large class of I/O and environment problems without any extra package. They require no import in your test file.

tmp_path: isolated temporary directories

Any test that writes to disk should write to a temporary directory that disappears after the test. tmp_path provides a pathlib.Path object pointing to a unique directory created for each test:

def test_pipeline_writes_output(tmp_path):
    from grade_predictor.pipeline import run_pipeline
    output_file = tmp_path / "results.csv"
    run_pipeline(output_path=output_file)
    assert output_file.exists()
    assert output_file.stat().st_size > 0

No cleanup needed: pytest removes the directory after the test completes.

monkeypatch: patching environment variables and objects

monkeypatch is pytest’s built-in alternative to unittest.mock.patch for patching attributes, environment variables, and dictionary entries. Changes are automatically undone after each test:

def test_pipeline_config_from_env(monkeypatch):
    monkeypatch.setenv("MODEL_THRESHOLD", "0.75")
    monkeypatch.setenv("DATA_PATH", "tests/fixtures/sample.csv")
    from grade_predictor.config import PipelineConfig
    cfg = PipelineConfig()
    assert cfg.model_threshold == 0.75

Use monkeypatch.setattr to patch any object attribute: monkeypatch.setattr(module, "function_name", mock_fn).

capfd: capturing stdout and stderr

capfd captures file-descriptor-level output, including output from C extensions and subprocesses:

def test_report_prints_summary(capfd):
    from grade_predictor.report import print_summary
    print_summary(pass_rate=0.82, total=120)
    out, err = capfd.readouterr()
    assert "82%" in out
    assert err == ""

Pro Tip: Run tests in parallel with pytest-xdist

Install pytest-xdist and run with -n auto to distribute tests across all CPU cores. A suite that takes 45 seconds single-threaded often runs in 12 seconds with -n auto. The only requirement is that tests must not share mutable global state.

uv add –dev pytest-xdist
uv run pytest tests/ -n auto

Activity 6 - Built-in Fixtures

Goal: Write three tests, one per fixture. (1) Use tmp_path to write a small CSV with pd.DataFrame.to_csv and assert the file exists. (2) Use monkeypatch.setenv to set MODEL_THRESHOLD=0.9 and assert your PipelineConfig reads it correctly. (3) Use capfd to capture the output of a function that prints a summary line and assert the expected string appears in out.

Capstone - A Working Test Suite

Write a complete test suite for grade-predictor.

Capstone - 80% Coverage

Write a test suite that achieves 80% coverage. It must include:

One parametrized test covering all five grade letter boundaries
A session-scoped fixture that loads a sample from university_analytics.csv
One assert_frame_equal test on a DataFrame transform
One pytest.raises test on compute_grade with invalid weights
Coverage configured in pyproject.toml with –cov-fail-under=80

uv run pytest tests/ --cov=grade_predictor --cov-report=term-missing

Resource	Why it matters
BetterStack, pytest guide	Comprehensive walkthrough of parametrize, fixtures, and plugins
pydevtools, pytest + uv	The exact uv + pytest integration workflow
pandas testing utilities	`assert_frame_equal` and `assert_series_equal` reference
pytest-cov documentation	Coverage measurement and HTML report generation
pandera documentation	Schema-level DataFrame validation: the next step after dtype assertions

Concept	Key rule
Test function naming	`test_<what_it_does>`. Describes behavior, not implementation.
`parametrize`	One test body, many inputs. Boundary cases go in the table, not in separate functions.
Fixtures	`conftest.py` for shared data. Session scope for large CSV files.
`assert_frame_equal`	Never `df1.equals(df2)` for floats. Numeric tolerance is built in.
Coverage	80% target. Focus on silent-failure paths, not trivial lines.
`patch("module.name")`	Patch at the import location, not the definition location.
Mock only I/O boundaries	Network, file reads, databases. Real logic runs for real.
`pytest-mock` mocker fixture	Cleaner syntax for `patch` as a fixture; install as a test dependency.