Part 11: Polars & the Expression API

Open In Colab Download Notebook

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook continues from Part 10 (10-combining-reshaping.ipynb). You should be comfortable with pandas DataFrames, groupby, merging, and time-indexed data before starting here.

Polars is a modern DataFrame library written in Rust. It uses an expression-based API and a lazy evaluation model that lets it optimize queries before running them. The same two datasets from Parts 8-10 are used here so you can compare pandas and Polars syntax on familiar ground.

Topic Why it matters
DataFrame construction Polars and pandas share the concept; the API differs in key ways
Expression API pl.col() expressions replace bracket indexing and make queries composable
GroupBy and aggregation Same split-apply-combine logic, cleaner syntax
LazyFrame Defer execution and let Polars optimize the query plan before running
Date/time Polars parses and handles dates natively, no .dt accessor for most operations
Migration patterns Side-by-side pandas-to-Polars translations for the operations you already know
over() and streaming Group-aware window stats without reducing rows; process files larger than RAM in chunks

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 11 (Polars) you will be able to:

# Skill Covered in
1 Build and inspect Polars DataFrames and compare them to pandas Sec. 1
2 Use pl.col() expressions to select, filter, and derive columns Sec. 2
3 Aggregate data with group_by and the expression API Sec. 3
4 Understand eager vs lazy evaluation and use LazyFrame Sec. 4
5 Parse and manipulate dates with Polars’s built-in datetime support Sec. 5
6 Translate common pandas operations to Polars equivalents Sec. 6
7 Compute per-group window stats with over() and process large files with scan_csv Sec. 7
import numpy as np
import polars as pl

# Polars reads CSV directly; no separate parse step needed for standard columns
df = pl.read_csv("data/university_analytics.csv")
print(f"shape : {df.shape}")
df.head()
shape : (2400, 18)
shape: (5, 18)
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed
str i64 str str str str bool str str str str f64 f64 f64 f64 f64 str bool
"S0001" 2023 "Information Technology" "F" "South" "Father" true "C01" "Python Programming" "Fall 2023" "2023-09-04" 10.8 88.6 46.1 54.4 57.8 "D" true
"S0001" 2023 "Information Technology" "F" "South" "Father" true "C02" "Statistics" "Spring 2024" "2024-01-15" 16.1 71.5 49.9 64.9 67.0 "C" true
"S0001" 2023 "Information Technology" "F" "South" "Father" true "C03" "Data Structures" "Fall 2023" "2023-09-04" 23.6 71.5 57.3 64.1 83.0 "C" true
"S0001" 2023 "Information Technology" "F" "South" "Father" true "C04" "Linear Algebra" "Spring 2024" "2024-01-15" 4.7 78.3 75.6 59.5 64.3 "C" true
"S0001" 2023 "Information Technology" "F" "South" "Father" true "C05" "Machine Learning" "Fall 2023" "2023-09-04" 16.2 63.2 53.6 59.4 57.0 "C" true

0. When pandas Is Not Enough

Your pandas pipeline works perfectly at 50,000 rows. The groupby runs in a second, the merge finishes before you blink, and everything fits in memory with room to spare. Then someone drops a 10-million-row export on your desk. The same pipeline now takes three minutes, swaps to disk halfway through, and your laptop fan starts spinning like it is trying to take off.

This is not a pandas bug. It is a design trade-off. pandas processes one column at a time on a single CPU thread, using a Python object model that was designed for flexibility rather than throughput. For most interactive data science that trade-off is exactly right. But when data gets large enough to feel slow, you need a different tool.

Polars (pola.rs) was written from scratch in Rust by Ritchie Vink in 2020. It stores data in Apache Arrow format, executes operations across all CPU cores in parallel, and evaluates expressions lazily — building a query plan and optimising it before touching a single row. The same operations that take minutes in pandas can take seconds in Polars, without changing what the code looks like.

How it compares

pandas Polars
Core language Python / C Rust
Execution Single-threaded Multi-threaded
Memory model NumPy / Python objects Apache Arrow
Evaluation Eager (runs immediately) Lazy by default (scan_*.collect())
API style Index-based, many methods Expression-based, composable
Ecosystem maturity Very mature (2008) Fast-growing (2020)
When to use Up to ~5M rows; default choice Large files, performance-critical pipelines

You do not have to choose one forever. Many production pipelines read and clean with Polars for speed, then convert to pandas for the parts of the ML ecosystem that expect it.

Already in your environment

uv add polars          # for a standalone project

Official docs and user guide: docs.pola.rs

1. DataFrame Construction

Building a Polars DataFrame from a Python dict looks almost identical to pandas. The difference shows up in the schema: Polars always knows the dtype of every column, and it tells you explicitly:

import polars as pl

sample = pl.DataFrame(
    {
        "student_id": ["s001", "s002", "s003"],
        "midterm_score": [0.62, 0.78, 0.91],
        "gender": ["F", "M", "F"],
    }
)
print(sample.schema)
sample
Schema({'student_id': String, 'midterm_score': Float64, 'gender': String})
shape: (3, 3)
student_id midterm_score gender
str f64 str
"s001" 0.62 "F"
"s002" 0.78 "M"
"s003" 0.91 "F"

Key Concept: Polars enforces strict dtypes; pandas 3 infers them

Every Polars column has a fixed dtype from creation: pl.Utf8 for strings, pl.Float64 for floats, pl.Int64 for integers. Polars raises an error if you try to mix types within a column. Pandas 3’s str dtype is conceptually similar but more permissive: it silently allows None values in a string column. Polars is stricter, which makes bugs surface earlier.

The first checks after loading data are the same in Polars as in pandas, but the method names differ slightly:

print(f"shape      : {df.shape}")
print(f"dtypes     :\n{df.dtypes}")
df.describe()
shape      : (2400, 18)
dtypes     :
[String, Int64, String, String, String, String, Boolean, String, String, String, String, Float64, Float64, Float64, Float64, Float64, String, Boolean]
shape: (9, 19)
statistic student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed
str str f64 str str str str f64 str str str str f64 f64 f64 f64 f64 str f64
"count" "2400" 2400.0 "2400" "2400" "2400" "2400" 2400.0 "2400" "2400" "2400" "2400" 2400.0 2327.0 2330.0 2400.0 2400.0 "2400" 2400.0
"null_count" "0" 0.0 "0" "0" "0" "0" 0.0 "0" "0" "0" "0" 0.0 73.0 70.0 0.0 0.0 "0" 0.0
"mean" null 2023.0075 null null null null 0.745 null null null null 18.029333 75.057542 60.846481 59.8195 65.280292 null 0.961667
"std" null 0.798567 null null null null null null null null null 6.88189 14.115225 13.019003 15.076345 10.518972 null null
"min" "S0001" 2022.0 "Computer Science" "F" "Central" "Father" 0.0 "C01" "Data Structures" "Fall 2022" "2022-09-05" 2.0 28.2 17.0 10.0 15.0 "A" 0.0
"25%" null 2022.0 null null null null null null null null null 13.3 64.8 51.7 49.8 58.3 null null
"50%" null 2023.0 null null null null null null null null null 17.8 75.0 60.9 60.2 65.4 null null
"75%" null 2024.0 null null null null null null null null null 22.6 85.2 69.6 70.0 72.5 null null
"max" "S0400" 2024.0 "Information Technology" "Other" "West" "Sibling" 1.0 "C06" "Statistics" "Spring 2025" "2025-01-13" 45.0 100.0 100.0 100.0 100.0 "F" 1.0

Pro Tip: Use df.schema to see column names and dtypes in one dict-like view

df.schema returns a Schema object, a mapping from column name to Polars dtype. It is the fastest way to confirm that read_csv inferred the right types before any analysis starts.

Activity 1 - First Look at the Polars DataFrame

Goal: Print the schema of df, then use df.describe() to get summary statistics. Compare the dtype labels to the pandas output from Part 8.
print(df.schema)
df.describe()
# TODO: print schema, then call df.describe()
...

2. The Expression API: select, filter, with_columns

The biggest difference from pandas is that Polars does not use bracket indexing for column selection or boolean filtering. Instead, every operation goes through pl.col() expressions, which describe what to compute without computing it immediately.

# Select specific columns -- equivalent to df[["col_a", "col_b"]] in pandas
df.select(["student_id", "midterm_score", "final_score"]).head(3)
shape: (3, 3)
student_id midterm_score final_score
str f64 f64
"S0001" 46.1 54.4
"S0001" 49.9 64.9
"S0001" 57.3 64.1

Key Concept: pl.col() is an expression, not a value

pl.col(“midterm_score”) describes a column reference, not the column itself. Expressions compose: pl.col(“a”) + pl.col(“b”) describes an addition without performing it. Polars evaluates all expressions in a .select() or .with_columns() call together, which is what lets it parallelize across columns automatically.

filter is the Polars equivalent of boolean indexing:

# filter -- equivalent to df[df["midterm_score"] > 90] in pandas
high_scorers = df.filter(pl.col("midterm_score") > 90)
print(f"high scorers: {len(high_scorers)} of {len(df)} students")
high_scorers.head()
high scorers: 31 of 2400 students
shape: (5, 18)
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed
str i64 str str str str bool str str str str f64 f64 f64 f64 f64 str bool
"S0012" 2024 "Engineering" "M" "North" "Mother" false "C05" "Machine Learning" "Fall 2024" "2024-09-02" 23.2 87.3 90.3 55.3 56.4 "C" true
"S0022" 2022 "Data Science" "F" "North" "Mother" true "C05" "Machine Learning" "Fall 2022" "2022-09-05" 22.4 63.2 98.7 48.1 56.2 "C" true
"S0026" 2024 "Computer Science" "M" "Central" "Father" true "C01" "Python Programming" "Fall 2024" "2024-09-02" 26.7 85.3 90.3 47.6 65.9 "C" true
"S0027" 2024 "Data Science" "F" "East" "Other" true "C04" "Linear Algebra" "Spring 2025" "2025-01-13" 26.2 78.0 95.7 70.9 66.6 "B" true
"S0043" 2024 "Information Technology" "M" "East" "Mother" true "C04" "Linear Algebra" "Spring 2025" "2025-01-13" 12.8 85.5 94.7 53.8 64.8 "C" true

Combine conditions with & and |, same as pandas, but directly inside filter with pl.col() expressions:

# combine conditions: failed AND no internet access
failed_no_internet = df.filter(
    (pl.col("final_grade") == "F") & (pl.col("has_internet") == False)  # noqa: E712
)
print(f"failed with no internet: {len(failed_no_internet)}")
failed with no internet: 60

with_columns adds or replaces columns. The result is always a new DataFrame; the original is never modified:

# Add average_marks -- equivalent to df["average_marks"] = ... in pandas
df = df.with_columns(average_marks=((pl.col("midterm_score") + pl.col("final_score") + pl.col("project_score")) / 3))
df.select(["student_id", "average_marks"]).head()
shape: (5, 2)
student_id average_marks
str f64
"S0001" 52.766667
"S0001" 60.6
"S0001" 68.133333
"S0001" 66.466667
"S0001" 56.666667

Common Mistake: Using bracket assignment instead of with_columns

df[“average_marks”] = … raises an error in Polars, unlike pandas. Polars DataFrames are immutable by design: the only way to add or modify a column is to call .with_columns() and reassign the result. This enforces the copy-on-write discipline explicitly, rather than relying on the runtime to enforce it as pandas 3 does.

Activity 2 - Filter and Derive

Goal: Add a passed column that is True when average_marks >= 0.5, using .with_columns() and pl.col(“average_marks”) >= 0.5. Then filter to passing students and print how many there are.
df = df.with_columns(
    passed=(pl.col("average_marks") >= 0.5)
)
df.filter(pl.col("passed")).shape[0]
# TODO: add "passed" column, filter to passing students, print count
...

Pro Tip: Use polars.selectors to select columns by type, not by name

polars.selectors (aliased as cs) provides type-based column selectors, so you can write expressions like “all numeric columns” or “all string columns” without listing names:

import polars.selectors as cs

# Normalise every numeric column in one expression
df.with_columns(
    (cs.numeric() - cs.numeric().mean()) / cs.numeric().std()
)

# Select only string columns
df.select(cs.string())

Common selectors: cs.numeric(), cs.string(), cs.boolean(), cs.temporal(), cs.by_dtype(pl.Float64). Combine them with | (union) or & (intersection): cs.numeric() | cs.boolean(). This is especially useful in feature engineering pipelines where the set of columns may change between runs.

3. GroupBy and Aggregation

Polars group_by uses the same split-apply-combine concept as pandas, but the aggregation is specified as a list of expressions inside .agg(), not as a method chained after .mean() or .sum():

df.group_by("program").agg(
    pl.col("average_marks").mean().alias("mean_marks"),
    pl.col("average_marks").std().alias("std_marks"),
    pl.len().alias("count"),
).sort("program")
shape: (4, 4)
program mean_marks std_marks count
str f64 f64 u32
"Computer Science" 62.001468 8.08436 726
"Data Science" 63.685264 8.457177 828
"Engineering" 60.603548 7.907489 318
"Information Technology" 59.920338 8.394286 528

Key Concept: .agg() takes a list of expressions, each producing one output column

In pandas, groupby(“program”)[“average_marks”].agg([“mean”, “std”, “count”]) returns a DataFrame with the stats as columns. In Polars the same result comes from passing named expressions to .agg(): pl.col(“average_marks”).mean().alias(“mean_marks”). The explicit .alias() is required when the default auto-generated name would be ambiguous.

Grouping by more than one column works the same way:

df.group_by(["program", "gender"]).agg(
    pl.col("average_marks").mean().alias("mean_marks"),
).sort(["program", "gender"])
shape: (11, 3)
program gender mean_marks
str str f64
"Computer Science" "F" 61.961067
"Computer Science" "M" 61.872101
"Computer Science" "Other" 63.088652
"Data Science" "F" 63.518499
"Data Science" "M" 63.971014
"Engineering" "F" 61.998272
"Engineering" "M" 59.527619
"Information Technology" "F" 59.831481
"Information Technology" "M" 60.243373
"Information Technology" "Other" 55.083333

Pro Tip: Sort the result explicitly: group_by order is not guaranteed

Polars group_by does not preserve or sort the output by the grouping key. Always chain .sort(“key_col”) after .agg() if the output order matters. The pandas equivalent sorted by default in older versions but no longer guarantees it either.

Activity 3 - Pass Rate by Program

Goal: Group df by program and compute, for each group, the fraction of students whose continue_drop is False. In Polars, (pl.col(“continue_drop”) == False).mean() gives the fraction directly.
df.group_by("program").agg(
    drop_fraction=(pl.col("continue_drop") == False).mean()
).sort("program")
# TODO: group by program, compute pass fraction per program, sort by program
...

4. LazyFrame and Lazy Evaluation

Every Polars operation above ran eagerly: it computed the result immediately and returned a new DataFrame. df.lazy() switches to lazy mode. In lazy mode, operations build a query plan instead of executing immediately. .collect() runs the optimized plan.

Lazy evaluation: plan builds up, data reads only on .collect()

flowchart LR
    A["scan_csv('file.csv')
no data read yet"] --> B["filter(col > 60)
adds predicate"]
    B --> C["group_by('program')
adds aggregation"]
    C --> D["select(['program','mean'])
projects columns"]
    D -->|".collect()"| E["query optimised:
1. predicate pushdown
2. projection pruning
3. execute in chunks"]
    E --> F["DataFrame
only now in memory"]

    style A fill:#F5F3FF,stroke:#7C3AED,color:#3B0764
    style E fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style F fill:#EBF5F0,stroke:#059669,color:#065F46

# Build a query plan -- nothing executes yet
lazy_query = (
    df.lazy()
    .filter(pl.col("has_internet") == 1)
    .group_by("program")
    .agg(pl.col("average_marks").mean().alias("mean_marks"))
    .sort("mean_marks", descending=True)
)

# Inspect the query plan before running
print(lazy_query.explain())
SORT BY [descending: [true]] [col("mean_marks")]
  AGGREGATE[maintain_order: false]
    [col("average_marks").mean().alias("mean_marks")] BY [col("program")]
    FROM
    simple π 2/2 ["program", "average_marks"]
      FILTER [(col("has_internet").cast(Int32)) == (1)]
      FROM
        DF ["student_id", "cohort", "program", "gender", ...]; PROJECT["program", "average_marks", "has_internet"] 3/19 COLUMNS
# Execute the plan and get the result
lazy_query.collect()
shape: (4, 2)
program mean_marks
str f64
"Data Science" 64.831512
"Computer Science" 63.752968
"Engineering" 62.546487
"Information Technology" 61.896686

Key Concept: Lazy evaluation lets Polars optimize the query before running it

When you call .collect(), Polars analyses the whole query plan: it pushes filters as early as possible (predicate pushdown), drops columns that are not needed (projection pushdown), and parallelizes independent operations. The same query in eager mode runs each step sequentially with no cross-step optimization. For large datasets the difference is significant; for small ones it is negligible. The API is identical apart from the .lazy() / .collect() bookends.

Common Mistake: Forgetting .collect() and treating a LazyFrame as a result

df.lazy().filter(…).group_by(…).agg(…) returns a LazyFrame, not a DataFrame. Printing it shows the query plan, not the data. Call .collect() to get a real DataFrame. If you need the shape, columns, or to iterate over rows, you must collect first.

Activity 4 - Lazy Top Schools

Goal: Rewrite the top-schools capstone from Part 8 using a lazy query: filter to students with internet == 1, group by school_id, take the mean average_marks, sort descending, and .collect() the top 5.
top_schools = (
    df.lazy()
    .filter(pl.col("has_internet") == 1)
    .group_by("school_id")
    .agg(pl.col("average_marks").mean().alias("mean_marks"))
    .sort("mean_marks", descending=True)
    .limit(5)
    .collect()
)
top_schools
# TODO: lazy top-schools query
...

4.5 SQL Interface

Polars 1.x ships a built-in SQL interface. pl.SQLContext registers one or more DataFrames under names and lets you query them with standard SQL. This is useful for teams comfortable with SQL, for migrating existing SQL queries to Polars, and for ad-hoc exploration without learning the full expression API:

Key Concept: SQL queries run on lazy DataFrames and benefit from the same query optimization

pl.SQLContext compiles SQL into Polars’ lazy query plan. The optimizer applies the same predicate pushdown and projection pushdown as a hand-written .lazy().filter().select() chain. You get SQL’s readability and Polars’ performance for free.

import polars as pl

# Register DataFrames under SQL table names
ctx = pl.SQLContext(students=df, eager=True)

# Query 1: top-performing programs — pure SQL
result = ctx.execute("""
    SELECT
        program,
        AVG(final_score)    AS mean_score,
        COUNT(*)            AS n_students
    FROM students
    GROUP BY program
    ORDER BY mean_score DESC
""")
print(result)

# Query 2: filter + select — same as df.filter().select() but SQL syntax
high_scorers_sql = ctx.execute("""
    SELECT student_id, final_score
    FROM students
    WHERE final_score > 0.9
    ORDER BY final_score DESC
    LIMIT 5
""")
print(high_scorers_sql)
shape: (4, 3)
┌────────────────────────┬────────────┬────────────┐
│ program                ┆ mean_score ┆ n_students │
│ ---                    ┆ ---        ┆ ---        │
│ str                    ┆ f64        ┆ u32        │
╞════════════════════════╪════════════╪════════════╡
│ Data Science           ┆ 62.023551  ┆ 828        │
│ Computer Science       ┆ 59.627824  ┆ 726        │
│ Engineering            ┆ 58.146226  ┆ 318        │
│ Information Technology ┆ 57.63447   ┆ 528        │
└────────────────────────┴────────────┴────────────┘
shape: (5, 2)
┌────────────┬─────────────┐
│ student_id ┆ final_score │
│ ---        ┆ ---         │
│ str        ┆ f64         │
╞════════════╪═════════════╡
│ S0335      ┆ 100.0       │
│ S0010      ┆ 100.0       │
│ S0017      ┆ 100.0       │
│ S0021      ┆ 100.0       │
│ S0125      ┆ 100.0       │
└────────────┴─────────────┘

5. Date/Time in Polars

Polars can parse dates directly at read time with try_parse_dates=True. There is no separate to_datetime step for standard ISO-format date strings:

attendance = pl.read_csv("data/daily_attendance.csv", try_parse_dates=True)
print(attendance.dtypes)
attendance.head()
[Date, Int64, Float64]
shape: (5, 3)
date school_id attendance_rate
date i64 f64
2025-01-06 300 0.874
2025-01-07 300 0.928
2025-01-08 300 0.933
2025-01-09 300 0.847
2025-01-10 300 0.866

Key Concept: Polars parses dates at load time; pandas requires an explicit conversion step

In pandas, date columns read as str dtype until you call pd.to_datetime(). Polars with try_parse_dates=True infers and parses ISO-format dates automatically during read_csv, producing a pl.Date or pl.Datetime column. This means filtering and arithmetic work on dates immediately, without a conversion step.

Date-component extraction uses the .dt namespace, similar to pandas:

# Extract year, month, and weekday from the date column
attendance.select(
    [
        "date",
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.weekday().alias("weekday"),  # Mon=1 ... Sun=7
    ]
).head()
shape: (5, 4)
date year month weekday
date i32 i8 i8
2025-01-06 2025 1 1
2025-01-07 2025 1 2
2025-01-08 2025 1 3
2025-01-09 2025 1 4
2025-01-10 2025 1 5

Resampling to weekly or monthly means uses group_by_dynamic instead of pandas’ resample:

# Monthly mean attendance rate per school using group_by_dynamic
school_300 = attendance.filter(pl.col("school_id") == 300)

school_300.sort("date").group_by_dynamic("date", every="1mo", group_by="school_id").agg(
    pl.col("attendance_rate").mean().alias("monthly_mean")
)
shape: (3, 3)
school_id date monthly_mean
i64 date f64
300 2025-01-01 0.90345
300 2025-02-01 0.90985
300 2025-03-01 0.8651

Pro Tip: group_by_dynamic replaces resample in Polars

series.resample(“M”).mean() in pandas becomes df.sort(“date”).group_by_dynamic(“date”, every=“1mo”).agg(…) in Polars. The sort before group_by_dynamic is required: Polars raises an error if the time column is not sorted.

Activity 5 - Monthly Attendance by School

Goal: Use group_by_dynamic to compute the mean attendance_rate per school per month across all schools in the attendance table. Sort by date before grouping.
attendance.sort("date").group_by_dynamic(
    "date", every="1mo", group_by="school_id"
).agg(
    pl.col("attendance_rate").mean().alias("monthly_mean")
).sort(["school_id", "date"])
# TODO: monthly mean attendance per school using group_by_dynamic
...

6. Migration Patterns: pandas to Polars

The table below maps the most common pandas operations from Parts 8-10 to their Polars equivalents. The concepts are the same; the API surface is different:

Task pandas Polars
Load CSV pd.read_csv(path) pl.read_csv(path, try_parse_dates=True)
Select columns df[["a", "b"]] df.select(["a", "b"])
Filter rows df[df["col"] > 0.5] df.filter(pl.col("col") > 0.5)
Add column df["c"] = df["a"] + df["b"] df.with_columns(c=pl.col("a") + pl.col("b"))
Rename column df.rename(columns={"old": "new"}) df.rename({"old": "new"})
GroupBy + agg df.groupby("k")["v"].mean() df.group_by("k").agg(pl.col("v").mean())
Sort df.sort_values("col", ascending=False) df.sort("col", descending=True)
Missing values df.isna().sum() df.null_count()
Fill missing df["col"].fillna(val) df.with_columns(pl.col("col").fill_null(val))
One-hot encode pd.get_dummies(df["col"]) df.to_dummies("col")
Value counts df["col"].value_counts() df["col"].value_counts()
Lazy query (eager only) df.lazy().collect()
Resample series.resample("M").mean() df.sort("date").group_by_dynamic("date", every="1mo").agg(...)

Key Concept: Polars is strict; pandas is permissive

Polars raises errors early: wrong dtype in a column, missing .collect() on a LazyFrame, unsorted data before group_by_dynamic. Pandas absorbs most of these quietly, sometimes silently producing a wrong result. The Polars errors feel strict at first but save time in larger pipelines where a wrong-but-valid pandas result would surface much later.

7. Window Functions with over() and Streaming Large Files

Two Polars strengths that have no direct pandas equivalent: over() for grouped window calculations without reducing rows, and scan_csv/scan_parquet for processing files that do not fit in memory.

7a. The over() Expression

In pandas, adding a per-group statistic back to the original rows requires groupby().transform(). In Polars, pl.col("x").mean().over("group") does the same thing inside a single with_columns() call. It computes the aggregation per group and broadcasts the result back to every row, aligned to the original index automatically.

# Add the per-program mean and std to every student row, in one expression
df = df.with_columns(
    program_mean=pl.col("average_marks").mean().over("program"),
    program_std=pl.col("average_marks").std().over("program"),
)

# Z-score: how many standard deviations from the program mean?
df = df.with_columns(marks_zscore=(pl.col("average_marks") - pl.col("program_mean")) / pl.col("program_std"))

df.select(["student_id", "program", "average_marks", "program_mean", "marks_zscore"]).head(6)
shape: (6, 5)
student_id program average_marks program_mean marks_zscore
str str f64 f64 f64
"S0001" "Information Technology" 52.766667 59.920338 -0.852207
"S0001" "Information Technology" 60.6 59.920338 0.080967
"S0001" "Information Technology" 68.133333 59.920338 0.978403
"S0001" "Information Technology" 66.466667 59.920338 0.779855
"S0001" "Information Technology" 56.666667 59.920338 -0.387605
"S0001" "Information Technology" 69.466667 59.920338 1.137241

Key Concept: over() computes group stats without reducing rows

group_by().agg() produces one row per group: the result has as many rows as there are unique group values. .over(“group”) inside with_columns() keeps all original rows and attaches the group stat to each one. This is the right tool for feature engineering: every student in the “Engineering” program gets the engineering mean attached to their row, ready for modelling without a join.

# Rolling mean per school using over() -- group-aware rolling window
import polars as pl

attendance = pl.read_csv("data/daily_attendance.csv", try_parse_dates=True)

attendance_with_rolling = attendance.sort(["school_id", "date"]).with_columns(
    rolling_avg=pl.col("attendance_rate").rolling_mean(window_size=5, min_periods=3).over("school_id")
)

attendance_with_rolling.filter(pl.col("school_id") == 300).head(8)
shape: (8, 4)
date school_id attendance_rate rolling_avg
date i64 f64 f64
2025-01-06 300 0.874 null
2025-01-07 300 0.928 null
2025-01-08 300 0.933 0.911667
2025-01-09 300 0.847 0.8955
2025-01-10 300 0.866 0.8896
2025-01-13 300 0.909 0.8966
2025-01-14 300 0.896 0.8902
2025-01-15 300 0.905 0.8846

Activity 6 - Per-Program Z-Score

Goal: Using over(), add a column gender_mean_marks that is the mean average_marks within each gender group. Then add gender_zscore = (average_marks - gender_mean_marks) / gender_std_marks. Print the top 3 students by z-score in each gender.
df_z = df.with_columns(
    gender_mean = pl.col("average_marks").mean().over("gender"),
    gender_std  = pl.col("average_marks").std().over("gender"),
).with_columns(
    gender_zscore = (pl.col("average_marks") - pl.col("gender_mean")) / pl.col("gender_std")
)
df_z.sort("gender_zscore", descending=True).select(["student_id", "gender", "average_marks", "gender_zscore"]).head(6)
# TODO: per-gender z-score using over()
...

7b. Streaming: Processing Files Larger Than Memory

pl.scan_csv() and pl.scan_parquet() read a file lazily: no data loads until .collect() is called. Combined with Polars’ streaming engine, filters and projections execute in chunks, so the working memory never exceeds what the query actually needs, even if the source file is larger than RAM.

The API is identical to eager reads; the difference is scan_ instead of read_.

# scan_csv: no data loads here, only a query plan is built
lazy_scan = (
    pl.scan_csv("data/university_analytics.csv")
    .filter(pl.col("final_score") > 80)
    .select(["student_id", "program", "final_score"])
    .sort("final_score", descending=True)
)

# .collect() executes with streaming (chunks through the file)
result = lazy_scan.collect(streaming=True)
print(f"Rows matching filter: {len(result)}")
result.head()
Rows matching filter: 197
shape: (5, 3)
student_id program final_score
str str f64
"S0010" "Data Science" 100.0
"S0017" "Data Science" 100.0
"S0021" "Information Technology" 100.0
"S0125" "Computer Science" 100.0
"S0148" "Data Science" 100.0
Pro Tip: Use Parquet instead of CSV for large files

CSV is plain text: Polars must read every byte to find the columns you asked for. Parquet is columnar: scan_parquet() reads only the columns referenced in your query, skipping all others. A file with 50 columns where you only need 3 loads roughly 3/50th of the data. For production pipelines with large files, converting a CSV to Parquet once and scanning Parquet from then on is a common and effective optimisation.
# Convert once
df.write_parquet("data/university_analytics.parquet")

# All future reads: columnar, faster, smaller
pl.scan_parquet("data/university_analytics.parquet")   .select(["student_id", "final_score"])   .filter(pl.col("final_score") > 80)   .collect(streaming=True)

Common Mistake: Not sorting before a rolling window in Polars

pl.col(“rate”).rolling_mean(5).over(“school_id”) operates on whatever order the rows are in. Without .sort([“school_id”, “date”]) first, the 5-row window picks up rows in an arbitrary order and produces meaningless values. Polars does not sort automatically; it raises an error only for group_by_dynamic. For rolling_mean().over(), the wrong-order mistake is silent.

Capstone: Reproduce the pandas Capstone Exercises in Polars

Capstone Exercise - Top Schools, Risk Report, and Regional Summary

Goal: Reproduce three capstones from earlier parts using Polars:
  1. Top schools (Part 8): Filter to students with internet == 1, group by school_id, compute mean average_marks, sort descending, show top 5.
  2. Risk report (Part 9): Add a grade column based on average_marks using pl.when().then().otherwise(). Print grade distribution as proportions.
  3. Regional summary (Part 10): Create a schools DataFrame mapping school_id to region, join it onto df, and compute mean marks per region.
# 1. Top schools
top_schools = (
    df.lazy()
    .filter(pl.col("has_internet") == 1)
    .group_by("school_id")
    .agg(pl.col("average_marks").mean().alias("mean_marks"))
    .sort("mean_marks", descending=True)
    .limit(5)
    .collect()
)

# 2. Risk report -- pl.when().then().otherwise() replaces df.apply(letter_grade)
df_graded = df.with_columns(
    grade=pl.when(pl.col("average_marks") >= 0.8).then(pl.lit("A"))
           .when(pl.col("average_marks") >= 0.6).then(pl.lit("B"))
           .when(pl.col("average_marks") >= 0.4).then(pl.lit("C"))
           .otherwise(pl.lit("D"))
)
grade_distribution = df_graded["grade"].value_counts(normalize=True).sort("grade")

# 3. Regional summary
import polars as pl
school_ids = df["school_id"].unique().sort()
regions = ["North", "South", "East", "West"]
schools = pl.DataFrame({
    "school_id": school_ids,
    "region": [regions[i % 4] for i in range(len(school_ids))],
})
merged = df.join(schools, on="school_id", how="left")
regional_summary = merged.group_by("region").agg(
    pl.col("average_marks").mean().alias("mean_marks")
).sort("region")
# TODO: reproduce top schools, risk report, and regional summary in Polars
...

Further Reading

Resource Why it matters
Polars documentation: Getting started The primary reference; the user guide covers lazy evaluation, the expression API, and performance tips
Polars migration guide from pandas A direct comparison of common pandas idioms rewritten in Polars; read this alongside this notebook
Vink, R. (2023). Modern Polars Free online book that mirrors the official pandas docs structure, useful for side-by-side comparison
Polars expression cheat sheet over(), rolling_*, selectors, and string ops in one place
Ritchie Vink, creator of Polars, explains the design in the Polars blog The posts on the expression API and lazy evaluation explain why certain pandas patterns do not translate

Summary

Concept Key rule
pl.read_csv(try_parse_dates=True) Load data with automatic date inference
df.schema Dict-like view of column names and Polars dtypes
df.select([...]) Select columns; equivalent to df[["a", "b"]] in pandas
pl.col("name") Column expression; compose with arithmetic, comparisons, and string ops
df.filter(expr) Keep rows where the expression is True
df.with_columns(alias=expr) Add or overwrite columns; returns a new DataFrame
df.group_by(key).agg([exprs]) Split-apply-combine with explicit expression list
.alias("name") Rename an expression’s output column
df.lazy().collect() Defer execution so Polars can optimize the query plan
pl.when().then().otherwise() Conditional column derivation; replaces apply(lambda ...)
df.join(other, on=key, how=...) SQL-style join; same how values as pandas merge
.dt.year(), .dt.month() Date-component extraction in the .dt namespace
group_by_dynamic("date", every="1mo") Time-window grouping; requires sorted input
df.null_count() Count missing values per column; replaces df.isna().sum()
.over("group") Broadcast a group aggregation back to every row; no join needed
pl.col("x").rolling_mean(n).over("g") Per-group rolling window; sort by group and time first
pl.scan_csv() / pl.scan_parquet() Lazy file scan; push-down filters execute without loading full file
.collect(streaming=True) Execute in chunks; keeps peak memory low for files larger than RAM