Data Schema Validation with Pandera

DS-MLOps Dev Tools

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Part 19: Data Validation with Pydantic. Pandera extends the same “validate at the boundary” principle to DataFrames: where Pydantic validates individual records, Pandera validates the schema of an entire table.

The grade-predictor project continues here: a Pandera schema replaces the implicit assumptions about university_analytics.csv with explicit, testable contracts.

Callout markers used throughout this notebook are explained on the book cover page.

Learning Objectives

By the end of Part 20 you will be able to:

#	Skill	Covered in
1	Explain the difference between row-level (Pydantic) and schema-level (Pandera) validation	Sec. 1
2	Define a Pandera `DataFrameSchema` with column types and constraints	Sec. 2
3	Use the class-based API with `pa.DataFrameModel` for typed schemas	Sec. 3
4	Write custom element-wise and series-level checks	Sec. 4
5	Validate DataFrames in a pipeline and collect errors without stopping	Sec. 5
6	Use Pandera schemas as pytest fixtures to document data contracts	Sec. 6

0. Pydantic Validated the Row. Who Validates the Table?

You have a StudentRecord Pydantic model. It validates midterm_score is between 0 and 100, that student_id matches S\d{4}, and that program is a non-empty string. Pydantic runs when a single record enters the system.

Now you load university_analytics.csv into a DataFrame. There are 2,400 rows. You could loop through them with StudentRecord.model_validate, but that tells you nothing about the table: whether student_id is unique across all rows, whether the distribution of program values matches what you expect, whether the proportion of missing values in has_internet is within an acceptable bound. A row validator answers “is this row correct?”. A schema validator answers “is this table correct?”.

Pandera (pandera.readthedocs.io) is a statistical data validation library for DataFrames. You define a schema: column types, constraints, uniqueness, allowed values, statistical properties, and Pandera checks the whole DataFrame against it in one call. It supports pandas and Polars, integrates with pytest, and can generate synthetic data for testing.

Install

uv add pandera          # or: pip install pandera

1. Row Validation vs Schema Validation

flowchart LR
    CSV["university_analytics.csv"] --> DF["pandas DataFrame\n2,400 rows × 10 cols"]
    DF --> PA["Pandera schema\nColumn types, constraints,\nuniqueness, statistics"]
    PA -->|"schema violation"| ERR["SchemaError\ncell location + rule"]
    PA -->|"all checks pass"| OK["validated DataFrame\nsafe to use downstream"]

    style OK fill:#EBF5F0,stroke:#059669,color:#065F46
    style ERR fill:#FEF2F2,stroke:#DC2626,color:#991B1B
    style PA fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E

Pydantic and Pandera answer different questions:

	Pydantic `BaseModel`	Pandera `DataFrameSchema`
Unit	One record (row)	Whole DataFrame
Checks	Type coercion, field constraints, cross-field	Column dtype, nullability, uniqueness, value ranges, statistical bounds
Returns	Validated model instance	Validated DataFrame
Error info	Field path + message	Row index + column + failed check
Best for	API inputs, config objects	CSVs, pipeline data, feature tables

Key Concept: Use both, not one

Pydantic and Pandera are not alternatives; they validate at different levels. A production pipeline uses Pydantic to validate records as they arrive (JSON from an API, rows from a queue) and Pandera to validate the assembled DataFrame before it enters a model or a write operation. Together they form a complete contract on the data.

2. Defining a `DataFrameSchema`

The functional API creates a schema by describing each column with pa.Column:

import numpy as np
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "student_id": pa.Column(str, pa.Check.str_matches(r"^S\d{4}$")),
        "midterm_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "final_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "project_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "program": pa.Column(str, pa.Check.isin(["CS", "EE", "Math", "Physics", "Biology"])),
        "has_internet": pa.Column(bool),
        "school_id": pa.Column(str, nullable=True),
        "teacher_count": pa.Column(int, pa.Check.ge(1)),
        "school_size": pa.Column(str, pa.Check.isin(["Small", "Medium", "Large"])),
        "pass_threshold": pa.Column(float, pa.Check.between(50.0, 80.0), nullable=True),
    },
    checks=[
        pa.Check(lambda df: df["student_id"].is_unique, error="student_id must be unique"),
    ],
    coerce=True,
)

print("Schema created with", len(schema.columns), "columns")

Schema created with 10 columns

coerce=True tells Pandera to attempt type coercion before validation: "78.5" becomes 78.5 for a float column, mirroring Pydantic’s behaviour at the row level.

Validate a DataFrame:

# Build a small valid sample
sample_df = pd.DataFrame(
    {
        "student_id": ["S0001", "S0002", "S0003"],
        "midterm_score": [78.5, 65.0, 90.0],
        "final_score": [82.0, 70.0, 88.0],
        "project_score": [91.0, 75.0, 85.0],
        "program": ["CS", "EE", "Math"],
        "has_internet": [True, False, True],
        "school_id": ["SCH01", "SCH01", "SCH02"],
        "teacher_count": [12, 8, 15],
        "school_size": ["Large", "Medium", "Large"],
        "pass_threshold": [60.0, 60.0, 65.0],
    }
)

validated = schema.validate(sample_df)
print("Validation passed:", validated.shape)

Validation passed: (3, 10)

# Introduce an invalid row (midterm_score > 100)
bad_df = sample_df.copy()
bad_df.loc[0, "midterm_score"] = 150.0

try:
    schema.validate(bad_df)
except pa.errors.SchemaError as e:
    print(type(e).__name__)
    print(e)

SchemaError
Column 'midterm_score' failed element-wise validator number 1: less_than_or_equal_to(100.0) failure cases: 150.0

Activity 1 - Duplicate student_id

Goal: Create a DataFrame where two rows share the same student_id. Run schema.validate(df) and confirm it raises a SchemaError with a message about uniqueness.

dup_df = sample_df.copy()
dup_df.loc[1, "student_id"] = "S0001"  # duplicate
schema.validate(dup_df)  # should raise

3. `DataFrameModel`: The Class-Based API

The functional DataFrameSchema API is explicit but verbose. Pandera’s class-based API, pa.DataFrameModel, mirrors Pydantic’s BaseModel: define columns as class-level fields with type annotations, and use pa.Field for constraints:

from typing import Optional

import pandera as pa
from pandera.typing import Series


class StudentDataSchema(pa.DataFrameModel):
    student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    program: Series[str] = pa.Field(isin=["CS", "EE", "Math", "Physics", "Biology"])
    has_internet: Series[bool]
    school_id: Series[str] | None = pa.Field(nullable=True)
    teacher_count: Series[int] = pa.Field(ge=1)
    school_size: Series[str] = pa.Field(isin=["Small", "Medium", "Large"])

    class Config:
        coerce = True

    @pa.check("student_id")
    def student_id_unique(cls, series: Series[str]) -> bool:  # noqa: N805
        return series.is_unique


validated = StudentDataSchema.validate(sample_df)
print("Class-based validation passed:", validated.shape)

Class-based validation passed: (3, 10)

The @pa.check decorator attaches a column-level check as a classmethod. The check receives the entire Series and must return a boolean (or a boolean Series for element-wise checks).

Pro Tip: Use DataFrameModel for reuse, DataFrameSchema for quick scripts

DataFrameModel is easier to subclass, document, and test: it reads like a dataclass and fits naturally alongside Pydantic models. DataFrameSchema is useful when you want to build a schema programmatically at runtime, e.g., from a config file or database metadata.

Activity 2 - Extend the Schema

Goal: Subclass StudentDataSchema and add a semester column constrained to [“Fall”, “Spring”, “Summer”]. Validate a DataFrame that includes the column with valid values, then one that has an invalid value. Confirm only the second raises a SchemaError.

class ExtendedSchema(StudentDataSchema):
    semester: Series[str] = pa.Field(isin=["Fall", "Spring", "Summer"])

4. Custom Checks

Built-in checks cover ranges, null counts, string patterns, and allowed values. Custom checks handle business rules that require more logic:

import pandera as pa
from pandera.typing import Series


class GradeSchema(pa.DataFrameModel):
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)

    @pa.check("midterm_score", name="not_all_perfect")
    def not_all_perfect_scores(cls, series: Series[float]) -> bool:  # noqa: N805
        return (series == 100.0).mean() < 0.5  # less than 50% perfect scores

    @pa.dataframe_check
    def weighted_average_reasonable(cls, df: pd.DataFrame) -> bool:  # noqa: N805
        avg = df["midterm_score"] * 0.3 + df["final_score"] * 0.45 + df["project_score"] * 0.25
        return (avg >= 0.0).all() and (avg <= 100.0).all()


# Validate the sample
result = GradeSchema.validate(sample_df[["midterm_score", "final_score", "project_score"]])
print("Custom checks passed:", result.shape)

Custom checks passed: (3, 3)

@pa.check("column_name") adds a column-level check: return True (the whole series passes), a boolean Series (element-wise result), or raise with a message.

@pa.dataframe_check gets the whole DataFrame: use it for cross-column constraints.

Activity 3 - Cross-Column Check

Goal: Add a @pa.dataframe_check to GradeSchema that verifies midterm_score and final_score are not both 0 for the same student (a student with both scores at 0 is almost certainly an error, not a result). Confirm it passes on valid data and fails when you introduce a row with both at 0.

@pa.dataframe_check
def not_both_zero(cls, df: pd.DataFrame) -> pd.Series:
    return ~((df["midterm_score"] == 0) & (df["final_score"] == 0))

5. Validation in a Pipeline

In a pipeline, you want to validate without crashing on the first error, collecting all failures and decide what to do with them:

import pandas as pd
import pandera as pa


def load_and_validate(
    path: str,
    schema: type[pa.DataFrameModel],
    *,
    lazy: bool = True,
) -> tuple[pd.DataFrame, list[dict]]:
    df = pd.read_csv(path)

    if not lazy:
        return schema.validate(df, lazy=False), []

    try:
        return schema.validate(df, lazy=True), []
    except pa.errors.SchemaErrors as e:
        error_df = e.failure_cases
        errors = error_df.to_dict(orient="records")
        # Return only the valid rows for downstream use
        bad_idx = set(error_df["index"].dropna().astype(int).tolist())
        clean_df = df.drop(index=list(bad_idx)).reset_index(drop=True)
        return clean_df, errors


# Use it
try:
    clean, errors = load_and_validate(
        "tutorials/data/university_analytics.csv",
        StudentDataSchema,
    )
    print(f"Clean rows: {len(clean)}, Errors: {len(errors)}")
    if errors:
        print("First error:", errors[0])
except FileNotFoundError:
    print("CSV not found in this context; run from repo root")

CSV not found in this context; run from repo root

Key Concept: lazy=True collects all errors; lazy=False stops on the first

schema.validate(df, lazy=False) (the default) raises a SchemaError on the first failure: fast and clear for development. schema.validate(df, lazy=True) collects every failure and raises a SchemaErrors (note the plural) at the end, better for production, where you want a full error report rather than a partial run.

Activity 4 - Full Error Report

Goal: Introduce three invalid rows into sample_df: one with midterm_score=150.0, one with an invalid program, and one with a duplicate student_id. Call StudentDataSchema.validate(bad_df, lazy=True). Catch the SchemaErrors exception and print the failure_cases DataFrame showing all three failures at once.

6. Schemas as Data Contracts in Tests

A Pandera schema is documentation that runs. Put it in a pytest fixture and every test that touches a DataFrame automatically validates the contract:

# tests/test_data_contracts.py  (save and run: uv run pytest tests/)

# ── paste into a test file and run with pytest ──
import pandas as pd
import pandera as pa
from pandera.typing import Series
import pytest


class StudentDataSchema(pa.DataFrameModel):
    student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    program: Series[str] = pa.Field(isin=["CS", "EE", "Math", "Physics", "Biology"])
    has_internet: Series[bool]
    teacher_count: Series[int] = pa.Field(ge=1)
    school_size: Series[str] = pa.Field(isin=["Small", "Medium", "Large"])

    class Config:
        coerce = True


@pytest.fixture
def valid_df():
    return pd.DataFrame(
        {
            "student_id": ["S0001", "S0002"],
            "midterm_score": [78.5, 65.0],
            "final_score": [82.0, 70.0],
            "project_score": [91.0, 75.0],
            "program": ["CS", "EE"],
            "has_internet": [True, False],
            "teacher_count": [12, 8],
            "school_size": ["Large", "Medium"],
        }
    )


def test_valid_data_passes_schema(valid_df):
    validated = StudentDataSchema.validate(valid_df)
    assert len(validated) == 2  # noqa: S101


def test_invalid_score_raises(valid_df):
    bad = valid_df.copy()
    bad.loc[0, "midterm_score"] = 150.0
    with pytest.raises(pa.errors.SchemaError):
        StudentDataSchema.validate(bad)


def test_invalid_program_raises(valid_df):
    bad = valid_df.copy()
    bad.loc[0, "program"] = "Underwater Basket Weaving"
    with pytest.raises(pa.errors.SchemaError):
        StudentDataSchema.validate(bad)


# Demonstration (not a real pytest run)
print("Tests defined. Run with: uv run pytest tests/test_data_contracts.py -v")

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 7
      5 import pandera as pa
      6 from pandera.typing import Series
----> 7 import pytest
     10 class StudentDataSchema(pa.DataFrameModel):
     11     student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")

ModuleNotFoundError: No module named 'pytest'

Pro Tip: Use pa.DataFrameModel.example() to generate test data automatically

Pandera can synthesise valid DataFrames from a schema: StudentDataSchema.example(size=50) generates 50 valid rows matching all constraints. This removes the need to hand-craft test fixtures for every new schema.

Activity 5 - Schema-Driven Test Data

Goal: Call StudentDataSchema.example(size=20) to generate 20 synthetic rows. Confirm that the generated DataFrame passes StudentDataSchema.validate() without errors. Then confirm that if you corrupt one cell (set a score to 200), validation fails.

synthetic = StudentDataSchema.example(size=20)
StudentDataSchema.validate(synthetic)  # should pass

Capstone: Data Contract for grade-predictor

Bring Pandera into the grade-predictor pipeline as a first-class data contract.

Capstone - End-to-End Validated Pipeline

Define StudentDataSchema in grade_predictor/schemas.py covering all columns of university_analytics.csv
Update load_and_validate (from Part 19) to run Pandera schema validation after row-level Pydantic validation
Add a @pa.dataframe_check that verifies the computed weighted average (using the weights from PipelineConfig) falls in [0, 100] for every row
Write three tests: one that the real CSV passes the schema, one that a DataFrame with a bad score raises SchemaError, and one that uses example() to generate synthetic data and confirms it passes
Run uv run pytest -v and confirm all three pass

7. Why Pandera? Comparing Schema Validation Tools

Before Pandera became the community standard for DataFrame validation, several tools tackled the same problem in different ways. Understanding the landscape helps you choose the right tool for your context.

Key Concept: Schema validation is not the same as data cleaning

Schema validation answers a yes/no question: does this DataFrame conform to the contract? It runs fast and fails loudly. Data cleaning transforms and repairs data. The two serve different purposes: validate at the pipeline boundary, clean before you get there.

Tool	Best for	Limitation
Pandera	DataFrame contracts in Python pipelines	pandas/polars-specific; not for arbitrary dicts
Great Expectations	Enterprise data quality suites, HTML reports, data docs	Heavy setup, complex configuration, overkill for most ML work
Pydantic	Row-level validation, API payloads, config models	Not designed for tabular data; looping over rows is slow
Frictionless Data	YAML-defined schemas, cross-language contracts	Less Python-native; validation logic lives in config files
Cerberus / marshmallow	Dict and object validation	No DataFrame support; designed for records not tables

The practical split comes down to scope. Pydantic is the right choice when you are validating individual records at a system boundary: an API payload, a config file, a single row entering a pipeline. Pandera is the right choice when you are validating the structure and statistics of an entire DataFrame. They are complementary, not competing: Part 19 showed you Pydantic for the row, this notebook shows you Pandera for the table.

Great Expectations is powerful but designed for data engineering teams who need data documentation, alerting, and audit trails across multiple data sources. For most ML projects, Pandera gives you 90% of the value with 10% of the setup.

Pro Tip: Use both Pydantic and Pandera in the same pipeline

Validate incoming JSON records with Pydantic as they arrive, then validate the assembled DataFrame with Pandera before it enters any model or transformation. The two checks run at different granularities and catch different classes of bugs: Pydantic catches a bad individual record, Pandera catches drift in the distribution across hundreds of records.

Resource	Why it matters
Pandera documentation	Full reference: DataFrameSchema, DataFrameModel, check types, Polars support
Pandera + Polars	Same API works on Polars DataFrames
Pandera pytest integration	`@pa.check_types` decorator for function-level schema enforcement
Pydantic + Pandera together	Use a Pandera schema as a Pydantic field type
Great Expectations	Enterprise-grade data quality platform; Pandera for code-first, GX for teams with data catalogs

Summary

Concept	Key rule
`DataFrameSchema`	Functional API: describe columns, constraints, table-level checks as a dict
`DataFrameModel`	Class-based API: columns as annotated fields, checks as classmethods
`pa.Field(ge=, le=)`	Same constraint vocabulary as Pydantic’s `Field`
`pa.Check.isin([...])`	Restrict a column to an explicit set of values
`@pa.check("col")`	Column-level custom check: receives the Series, returns bool or bool Series
`@pa.dataframe_check`	Cross-column check: receives the full DataFrame
`lazy=True`	Collect all failures; raises `SchemaErrors` (plural)
`lazy=False`	Stop on first failure; raises `SchemaError`
`schema.example(size=n)`	Generate n rows of valid synthetic data for testing
Pydantic + Pandera	Validate rows with Pydantic, validate the assembled table with Pandera

Next: Part 21: Classical ML: Scikit-learn pipelines applied to the fully validated, typed grade-predictor dataset.