Data Schema Validation with Pandera

Open In Colab Download Notebook

DS-MLOps Dev Tools

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Part 19: Data Validation with Pydantic. Pandera extends the same “validate at the boundary” principle to DataFrames: where Pydantic validates individual records, Pandera validates the schema of an entire table.

The grade-predictor project continues here: a Pandera schema replaces the implicit assumptions about university_analytics.csv with explicit, testable contracts.

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 20 you will be able to:

# Skill Covered in
1 Explain the difference between row-level (Pydantic) and schema-level (Pandera) validation Sec. 1
2 Define a Pandera DataFrameSchema with column types and constraints Sec. 2
3 Use the class-based API with pa.DataFrameModel for typed schemas Sec. 3
4 Write custom element-wise and series-level checks Sec. 4
5 Validate DataFrames in a pipeline and collect errors without stopping Sec. 5
6 Use Pandera schemas as pytest fixtures to document data contracts Sec. 6

0. Pydantic Validated the Row. Who Validates the Table?

You have a StudentRecord Pydantic model. It validates midterm_score is between 0 and 100, that student_id matches S\d{4}, and that program is a non-empty string. Pydantic runs when a single record enters the system.

Now you load university_analytics.csv into a DataFrame. There are 2,400 rows. You could loop through them with StudentRecord.model_validate, but that tells you nothing about the table: whether student_id is unique across all rows, whether the distribution of program values matches what you expect, whether the proportion of missing values in has_internet is within an acceptable bound. A row validator answers “is this row correct?”. A schema validator answers “is this table correct?”.

Pandera (pandera.readthedocs.io) is a statistical data validation library for DataFrames. You define a schema: column types, constraints, uniqueness, allowed values, statistical properties, and Pandera checks the whole DataFrame against it in one call. It supports pandas and Polars, integrates with pytest, and can generate synthetic data for testing.

Install

uv add pandera          # or: pip install pandera

1. Row Validation vs Schema Validation

flowchart LR
    CSV["university_analytics.csv"] --> DF["pandas DataFrame\n2,400 rows × 10 cols"]
    DF --> PA["Pandera schema\nColumn types, constraints,\nuniqueness, statistics"]
    PA -->|"schema violation"| ERR["SchemaError\ncell location + rule"]
    PA -->|"all checks pass"| OK["validated DataFrame\nsafe to use downstream"]

    style OK fill:#EBF5F0,stroke:#059669,color:#065F46
    style ERR fill:#FEF2F2,stroke:#DC2626,color:#991B1B
    style PA fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E

Pydantic and Pandera answer different questions:

Pydantic BaseModel Pandera DataFrameSchema
Unit One record (row) Whole DataFrame
Checks Type coercion, field constraints, cross-field Column dtype, nullability, uniqueness, value ranges, statistical bounds
Returns Validated model instance Validated DataFrame
Error info Field path + message Row index + column + failed check
Best for API inputs, config objects CSVs, pipeline data, feature tables

Key Concept: Use both, not one

Pydantic and Pandera are not alternatives; they validate at different levels. A production pipeline uses Pydantic to validate records as they arrive (JSON from an API, rows from a queue) and Pandera to validate the assembled DataFrame before it enters a model or a write operation. Together they form a complete contract on the data.

2. Defining a DataFrameSchema

The functional API creates a schema by describing each column with pa.Column:

import numpy as np
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "student_id": pa.Column(str, pa.Check.str_matches(r"^S\d{4}$")),
        "midterm_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "final_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "project_score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),
        "program": pa.Column(str, pa.Check.isin(["CS", "EE", "Math", "Physics", "Biology"])),
        "has_internet": pa.Column(bool),
        "school_id": pa.Column(str, nullable=True),
        "teacher_count": pa.Column(int, pa.Check.ge(1)),
        "school_size": pa.Column(str, pa.Check.isin(["Small", "Medium", "Large"])),
        "pass_threshold": pa.Column(float, pa.Check.between(50.0, 80.0), nullable=True),
    },
    checks=[
        pa.Check(lambda df: df["student_id"].is_unique, error="student_id must be unique"),
    ],
    coerce=True,
)

print("Schema created with", len(schema.columns), "columns")
Schema created with 10 columns

coerce=True tells Pandera to attempt type coercion before validation: "78.5" becomes 78.5 for a float column, mirroring Pydantic’s behaviour at the row level.

Validate a DataFrame:

# Build a small valid sample
sample_df = pd.DataFrame(
    {
        "student_id": ["S0001", "S0002", "S0003"],
        "midterm_score": [78.5, 65.0, 90.0],
        "final_score": [82.0, 70.0, 88.0],
        "project_score": [91.0, 75.0, 85.0],
        "program": ["CS", "EE", "Math"],
        "has_internet": [True, False, True],
        "school_id": ["SCH01", "SCH01", "SCH02"],
        "teacher_count": [12, 8, 15],
        "school_size": ["Large", "Medium", "Large"],
        "pass_threshold": [60.0, 60.0, 65.0],
    }
)

validated = schema.validate(sample_df)
print("Validation passed:", validated.shape)
Validation passed: (3, 10)
# Introduce an invalid row (midterm_score > 100)
bad_df = sample_df.copy()
bad_df.loc[0, "midterm_score"] = 150.0

try:
    schema.validate(bad_df)
except pa.errors.SchemaError as e:
    print(type(e).__name__)
    print(e)
SchemaError
Column 'midterm_score' failed element-wise validator number 1: less_than_or_equal_to(100.0) failure cases: 150.0
Activity 1 - Duplicate student_id

Goal: Create a DataFrame where two rows share the same student_id. Run schema.validate(df) and confirm it raises a SchemaError with a message about uniqueness.
dup_df = sample_df.copy()
dup_df.loc[1, "student_id"] = "S0001"  # duplicate
schema.validate(dup_df)  # should raise

3. DataFrameModel: The Class-Based API

The functional DataFrameSchema API is explicit but verbose. Pandera’s class-based API, pa.DataFrameModel, mirrors Pydantic’s BaseModel: define columns as class-level fields with type annotations, and use pa.Field for constraints:

from typing import Optional

import pandera as pa
from pandera.typing import Series


class StudentDataSchema(pa.DataFrameModel):
    student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    program: Series[str] = pa.Field(isin=["CS", "EE", "Math", "Physics", "Biology"])
    has_internet: Series[bool]
    school_id: Series[str] | None = pa.Field(nullable=True)
    teacher_count: Series[int] = pa.Field(ge=1)
    school_size: Series[str] = pa.Field(isin=["Small", "Medium", "Large"])

    class Config:
        coerce = True

    @pa.check("student_id")
    def student_id_unique(cls, series: Series[str]) -> bool:  # noqa: N805
        return series.is_unique


validated = StudentDataSchema.validate(sample_df)
print("Class-based validation passed:", validated.shape)
Class-based validation passed: (3, 10)

The @pa.check decorator attaches a column-level check as a classmethod. The check receives the entire Series and must return a boolean (or a boolean Series for element-wise checks).

Pro Tip: Use DataFrameModel for reuse, DataFrameSchema for quick scripts

DataFrameModel is easier to subclass, document, and test: it reads like a dataclass and fits naturally alongside Pydantic models. DataFrameSchema is useful when you want to build a schema programmatically at runtime, e.g., from a config file or database metadata.

Activity 2 - Extend the Schema

Goal: Subclass StudentDataSchema and add a semester column constrained to [“Fall”, “Spring”, “Summer”]. Validate a DataFrame that includes the column with valid values, then one that has an invalid value. Confirm only the second raises a SchemaError.
class ExtendedSchema(StudentDataSchema):
    semester: Series[str] = pa.Field(isin=["Fall", "Spring", "Summer"])

4. Custom Checks

Built-in checks cover ranges, null counts, string patterns, and allowed values. Custom checks handle business rules that require more logic:

import pandera as pa
from pandera.typing import Series


class GradeSchema(pa.DataFrameModel):
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)

    @pa.check("midterm_score", name="not_all_perfect")
    def not_all_perfect_scores(cls, series: Series[float]) -> bool:  # noqa: N805
        return (series == 100.0).mean() < 0.5  # less than 50% perfect scores

    @pa.dataframe_check
    def weighted_average_reasonable(cls, df: pd.DataFrame) -> bool:  # noqa: N805
        avg = df["midterm_score"] * 0.3 + df["final_score"] * 0.45 + df["project_score"] * 0.25
        return (avg >= 0.0).all() and (avg <= 100.0).all()


# Validate the sample
result = GradeSchema.validate(sample_df[["midterm_score", "final_score", "project_score"]])
print("Custom checks passed:", result.shape)
Custom checks passed: (3, 3)

@pa.check("column_name") adds a column-level check: return True (the whole series passes), a boolean Series (element-wise result), or raise with a message.

@pa.dataframe_check gets the whole DataFrame: use it for cross-column constraints.

Activity 3 - Cross-Column Check

Goal: Add a @pa.dataframe_check to GradeSchema that verifies midterm_score and final_score are not both 0 for the same student (a student with both scores at 0 is almost certainly an error, not a result). Confirm it passes on valid data and fails when you introduce a row with both at 0.
@pa.dataframe_check
def not_both_zero(cls, df: pd.DataFrame) -> pd.Series:
    return ~((df["midterm_score"] == 0) & (df["final_score"] == 0))

5. Validation in a Pipeline

In a pipeline, you want to validate without crashing on the first error, collecting all failures and decide what to do with them:

import pandas as pd
import pandera as pa


def load_and_validate(
    path: str,
    schema: type[pa.DataFrameModel],
    *,
    lazy: bool = True,
) -> tuple[pd.DataFrame, list[dict]]:
    df = pd.read_csv(path)

    if not lazy:
        return schema.validate(df, lazy=False), []

    try:
        return schema.validate(df, lazy=True), []
    except pa.errors.SchemaErrors as e:
        error_df = e.failure_cases
        errors = error_df.to_dict(orient="records")
        # Return only the valid rows for downstream use
        bad_idx = set(error_df["index"].dropna().astype(int).tolist())
        clean_df = df.drop(index=list(bad_idx)).reset_index(drop=True)
        return clean_df, errors


# Use it
try:
    clean, errors = load_and_validate(
        "tutorials/data/university_analytics.csv",
        StudentDataSchema,
    )
    print(f"Clean rows: {len(clean)}, Errors: {len(errors)}")
    if errors:
        print("First error:", errors[0])
except FileNotFoundError:
    print("CSV not found in this context; run from repo root")
CSV not found in this context; run from repo root

Key Concept: lazy=True collects all errors; lazy=False stops on the first

schema.validate(df, lazy=False) (the default) raises a SchemaError on the first failure: fast and clear for development. schema.validate(df, lazy=True) collects every failure and raises a SchemaErrors (note the plural) at the end, better for production, where you want a full error report rather than a partial run.

Activity 4 - Full Error Report

Goal: Introduce three invalid rows into sample_df: one with midterm_score=150.0, one with an invalid program, and one with a duplicate student_id. Call StudentDataSchema.validate(bad_df, lazy=True). Catch the SchemaErrors exception and print the failure_cases DataFrame showing all three failures at once.

6. Schemas as Data Contracts in Tests

A Pandera schema is documentation that runs. Put it in a pytest fixture and every test that touches a DataFrame automatically validates the contract:

# tests/test_data_contracts.py  (save and run: uv run pytest tests/)

# ── paste into a test file and run with pytest ──
import pandas as pd
import pandera as pa
from pandera.typing import Series
import pytest


class StudentDataSchema(pa.DataFrameModel):
    student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")
    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)
    program: Series[str] = pa.Field(isin=["CS", "EE", "Math", "Physics", "Biology"])
    has_internet: Series[bool]
    teacher_count: Series[int] = pa.Field(ge=1)
    school_size: Series[str] = pa.Field(isin=["Small", "Medium", "Large"])

    class Config:
        coerce = True


@pytest.fixture
def valid_df():
    return pd.DataFrame(
        {
            "student_id": ["S0001", "S0002"],
            "midterm_score": [78.5, 65.0],
            "final_score": [82.0, 70.0],
            "project_score": [91.0, 75.0],
            "program": ["CS", "EE"],
            "has_internet": [True, False],
            "teacher_count": [12, 8],
            "school_size": ["Large", "Medium"],
        }
    )


def test_valid_data_passes_schema(valid_df):
    validated = StudentDataSchema.validate(valid_df)
    assert len(validated) == 2  # noqa: S101


def test_invalid_score_raises(valid_df):
    bad = valid_df.copy()
    bad.loc[0, "midterm_score"] = 150.0
    with pytest.raises(pa.errors.SchemaError):
        StudentDataSchema.validate(bad)


def test_invalid_program_raises(valid_df):
    bad = valid_df.copy()
    bad.loc[0, "program"] = "Underwater Basket Weaving"
    with pytest.raises(pa.errors.SchemaError):
        StudentDataSchema.validate(bad)


# Demonstration (not a real pytest run)
print("Tests defined. Run with: uv run pytest tests/test_data_contracts.py -v")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 7
      5 import pandera as pa
      6 from pandera.typing import Series
----> 7 import pytest
     10 class StudentDataSchema(pa.DataFrameModel):
     11     student_id: Series[str] = pa.Field(str_matches=r"^S\d{4}$")

ModuleNotFoundError: No module named 'pytest'

Pro Tip: Use pa.DataFrameModel.example() to generate test data automatically

Pandera can synthesise valid DataFrames from a schema: StudentDataSchema.example(size=50) generates 50 valid rows matching all constraints. This removes the need to hand-craft test fixtures for every new schema.

Activity 5 - Schema-Driven Test Data

Goal: Call StudentDataSchema.example(size=20) to generate 20 synthetic rows. Confirm that the generated DataFrame passes StudentDataSchema.validate() without errors. Then confirm that if you corrupt one cell (set a score to 200), validation fails.
synthetic = StudentDataSchema.example(size=20)
StudentDataSchema.validate(synthetic)  # should pass

Capstone: Data Contract for grade-predictor

Bring Pandera into the grade-predictor pipeline as a first-class data contract.

Capstone - End-to-End Validated Pipeline

  1. Define StudentDataSchema in grade_predictor/schemas.py covering all columns of university_analytics.csv
  2. Update load_and_validate (from Part 19) to run Pandera schema validation after row-level Pydantic validation
  3. Add a @pa.dataframe_check that verifies the computed weighted average (using the weights from PipelineConfig) falls in [0, 100] for every row
  4. Write three tests: one that the real CSV passes the schema, one that a DataFrame with a bad score raises SchemaError, and one that uses example() to generate synthetic data and confirms it passes
  5. Run uv run pytest -v and confirm all three pass

7. Why Pandera? Comparing Schema Validation Tools

Before Pandera became the community standard for DataFrame validation, several tools tackled the same problem in different ways. Understanding the landscape helps you choose the right tool for your context.

Key Concept: Schema validation is not the same as data cleaning

Schema validation answers a yes/no question: does this DataFrame conform to the contract? It runs fast and fails loudly. Data cleaning transforms and repairs data. The two serve different purposes: validate at the pipeline boundary, clean before you get there.

Tool Best for Limitation
Pandera DataFrame contracts in Python pipelines pandas/polars-specific; not for arbitrary dicts
Great Expectations Enterprise data quality suites, HTML reports, data docs Heavy setup, complex configuration, overkill for most ML work
Pydantic Row-level validation, API payloads, config models Not designed for tabular data; looping over rows is slow
Frictionless Data YAML-defined schemas, cross-language contracts Less Python-native; validation logic lives in config files
Cerberus / marshmallow Dict and object validation No DataFrame support; designed for records not tables

The practical split comes down to scope. Pydantic is the right choice when you are validating individual records at a system boundary: an API payload, a config file, a single row entering a pipeline. Pandera is the right choice when you are validating the structure and statistics of an entire DataFrame. They are complementary, not competing: Part 19 showed you Pydantic for the row, this notebook shows you Pandera for the table.

Great Expectations is powerful but designed for data engineering teams who need data documentation, alerting, and audit trails across multiple data sources. For most ML projects, Pandera gives you 90% of the value with 10% of the setup.

Pro Tip: Use both Pydantic and Pandera in the same pipeline

Validate incoming JSON records with Pydantic as they arrive, then validate the assembled DataFrame with Pandera before it enters any model or transformation. The two checks run at different granularities and catch different classes of bugs: Pydantic catches a bad individual record, Pandera catches drift in the distribution across hundreds of records.

Further Reading

Resource Why it matters
Pandera documentation Full reference: DataFrameSchema, DataFrameModel, check types, Polars support
Pandera + Polars Same API works on Polars DataFrames
Pandera pytest integration @pa.check_types decorator for function-level schema enforcement
Pydantic + Pandera together Use a Pandera schema as a Pydantic field type
Great Expectations Enterprise-grade data quality platform; Pandera for code-first, GX for teams with data catalogs

Summary

Concept Key rule
DataFrameSchema Functional API: describe columns, constraints, table-level checks as a dict
DataFrameModel Class-based API: columns as annotated fields, checks as classmethods
pa.Field(ge=, le=) Same constraint vocabulary as Pydantic’s Field
pa.Check.isin([...]) Restrict a column to an explicit set of values
@pa.check("col") Column-level custom check: receives the Series, returns bool or bool Series
@pa.dataframe_check Cross-column check: receives the full DataFrame
lazy=True Collect all failures; raises SchemaErrors (plural)
lazy=False Stop on first failure; raises SchemaError
schema.example(size=n) Generate n rows of valid synthetic data for testing
Pydantic + Pandera Validate rows with Pydantic, validate the assembled table with Pandera

Next: Part 21: Classical ML: Scikit-learn pipelines applied to the fully validated, typed grade-predictor dataset.