Chapter 9: Pandas operations

The CSV is loaded and clean. Now the questions get specific: which students went from failing to passing between semesters? Which course names have a trailing space or a typo? What is the 10-student rolling average for each program? None of these answers live in a raw column: you have to derive them.

Chapter 8 covered loading, selecting, filtering, and fixing missing values: the shape and structure of a DataFrame. The operations that turn raw columns into derived ones are covered here: apply your own functions row-by-row, count categories, clean text, summarise groups, and compute window statistics. Chapter 10 (10-combining-reshaping.ipynb) builds on this with groupby, merge, pivot, and time-series reshaping.

Callout markers: book cover page.

import numpy as np
import pandas as pd

df = pd.read_csv("data/university_analytics.csv")
df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df.head()

	student_id	cohort	program	gender	region	guardian	has_internet	course_id	course	semester	enrollment_date	study_hours	attendance_pct	midterm_score	final_score	project_score	final_grade	passed	average_marks
0	S0001	2023	Information Technology	F	South	Father	True	C01	Python Programming	Fall 2023	2023-09-04	10.8	88.6	49.9	58.2	61.6	C	True	56.566667
1	S0001	2023	Information Technology	F	South	Father	True	C02	Statistics	Spring 2024	2024-01-15	16.1	71.5	48.7	63.7	65.8	C	True	59.400000
2	S0001	2023	Information Technology	F	South	Father	True	C03	Data Structures	Fall 2023	2023-09-04	23.6	71.5	54.1	60.9	79.8	C	True	64.933333
3	S0001	2023	Information Technology	F	South	Father	True	C04	Linear Algebra	Spring 2024	2024-01-15	4.7	78.3	71.4	55.3	60.1	C	True	62.266667
4	S0001	2023	Information Technology	F	South	Father	True	C05	Machine Learning	Fall 2023	2023-09-04	16.2	63.2	48.4	54.2	51.8	D	True	51.466667

What’s new in Pandas 3

This series is written against pandas 3, released in 2025, and two of its changes are worth understanding before going further, because they change what you see in everyday output.

Run df.dtypes and look at the text columns. In pandas 2 they showed up as object, the catch-all dtype for anything that isn’t a fixed-width number. Pandas 3 gives strings a dedicated str dtype instead, backed by PyArrow when it’s installed:

df.dtypes

student_id             str
cohort               int64
program                str
gender                 str
region                 str
guardian               str
has_internet          bool
course_id              str
course                 str
semester               str
enrollment_date        str
study_hours        float64
attendance_pct     float64
midterm_score      float64
final_score        float64
project_score      float64
final_grade            str
passed                bool
average_marks      float64
dtype: object

Key Concept: A dedicated str dtype, not object

object dtype could hold strings, but it could just as easily hold a mix of strings, numbers, and Python objects, with no guarantee which. The new str dtype can only hold strings or missing values, so a column typed str is a stronger guarantee than object ever was, and the PyArrow backing makes string operations faster. This is the same dtype the .str accessor in Sec. 3 operates on.

The second change doesn’t show up in any output, it changes how assignment behaves. Pandas 3 makes Copy-on-Write the only behaviour: selecting a subset of a DataFrame never lets you accidentally modify the original through it.

subset = df[["student_id", "average_marks"]]
subset.loc[0, "average_marks"] = 0.0
print(f"original unaffected: {df.loc[0, 'average_marks']}")
print(f"subset changed     : {subset.loc[0, 'average_marks']}")

original unaffected: 56.56666666666666
subset changed     : 0.0

Key Concept: Copy-on-Write removes the SettingWithCopyWarning guessing game

Before pandas 3, whether subset above was a view into df or an independent copy depended on details of how subset was created, the source of the infamous SettingWithCopyWarning. Pandas 3 makes every such selection behave as an independent copy: modifying subset never touches df. The warning is gone because the ambiguity it warned about is gone.

Pro Tip: pd.col() previews the expression style Polars uses

Pandas 3 adds pd.col(“name”) as an alternative to a lambda inside .assign(): df.assign(c=pd.col(“a”) + pd.col(“b”)) instead of df.assign(c=lambda d: d[“a”] + d[“b”]). It reads closer to the column-expression style used throughout Polars, covered in Chapter 8 of this series, and is worth recognising even though this notebook keeps using apply and lambda, the more widely used style today.

Learning objectives

By the end of Chapter 15 you will be able to:

#	Skill	Covered in
1	Recognise pandas 3’s `str` dtype and Copy-on-Write behaviour	Sec. 0
2	Apply your own function to a Series or DataFrame with `map` and `apply`	Sec. 1
3	Count, filter, and encode categorical columns	Sec. 2
4	Clean and query text columns with the `.str` accessor	Sec. 3
5	Summarize a dataset with descriptive statistics in one call	Sec. 4
6	Compute rolling and expanding window statistics	Sec. 5a
7	Write readable multi-condition filters with `df.query()`	Sec. 5b

1. Function Application with `map` and `apply`

NumPy’s vectorized operations cover arithmetic and comparisons, but sometimes the transformation you need is your own logic: a lookup, a multi-branch rule, a calculation that combines several columns per row. map and apply are how pandas runs that logic.

Series.map() is for simple one-to-one substitution: pass it a dict (or a function) and it replaces every value with its mapped equivalent. Recoding the gender column’s short codes into full labels is a map problem, not an apply one:

gender_labels = {"M": "Male", "F": "Female"}
df["gender_label"] = df["gender"].map(gender_labels)
df[["gender", "gender_label"]].head(3)

	gender	gender_label
0	F	Female
1	F	Female
2	F	Female

Series.map() transforms one column. For element-wise operations across ALL columns simultaneously, pandas 3.0 introduced DataFrame.map(): the replacement for the deprecated applymap(). Same signature, same semantics, new name.

# Round every numeric cell to 1 decimal place
numeric_cols = df.select_dtypes("number").columns.tolist()
rounded = df[numeric_cols].map(lambda x: round(x, 1) if pd.notna(x) else x)
rounded.head(3)

	cohort	study_hours	attendance_pct	midterm_score	final_score	project_score	average_marks
0	2023	10.8	88.6	49.9	58.2	61.6	56.6
1	2023	16.1	71.5	48.7	63.7	65.8	59.4
2	2023	23.6	71.5	54.1	60.9	79.8	64.9

Common Mistake: Using applymap() from an older tutorial

df.applymap() was removed in pandas 3.0. Any code or tutorial older than 2024 that uses applymap will raise AttributeError. The fix is a one-word rename: df.applymap(fn) becomes df.map(fn).

Key Concept: map is for substitution, apply is for logic

map expects a dict or a simple function and only ever looks at one value at a time. apply accepts any function, including one with branching logic, and can run over a Series (one value at a time) or a DataFrame (one row or one column at a time, depending on axis). Reach for map first; reach for apply when the rule doesn’t fit in a lookup table.

Converting a normalized mark (0 to 1) into a letter grade needs a multi-branch rule, a job for apply with a plain Python function:

def letter_grade(mark: float) -> str:
    if mark >= 80:
        return "A"
    elif mark >= 60:
        return "B"
    elif mark >= 40:
        return "C"
    else:
        return "D"


df["grade"] = df["average_marks"].apply(letter_grade)
df[["student_id", "average_marks", "grade"]].head()

	student_id	average_marks	grade
0	S0001	56.566667	C
1	S0001	59.400000	C
2	S0001	64.933333	B
3	S0001	62.266667	B
4	S0001	51.466667	C

np.where handles one condition. For multiple ordered conditions with different outcomes, Series.case_when() (pandas 2.2+) reads like a SQL CASE statement: check cases top-to-bottom, return the value for the first match, and fall back to a default.

df["performance"] = df["midterm_score"].case_when(
    caselist=[
        (df["midterm_score"] >= 85, "High"),
        (df["midterm_score"] >= 70, "Medium"),
        (df["midterm_score"] >= 55, "Low"),
        (df["midterm_score"] >= 0, "At Risk"),  # catch-all default
    ],
)
df[["midterm_score", "performance"]].head(8)

	midterm_score	performance
0	49.9	At Risk
1	48.7	At Risk
2	54.1	At Risk
3	71.4	Medium
4	48.4	At Risk
5	67.0	Low
6	93.6	High
7	63.1	Low

Pro Tip: case_when() for three or more branches instead of chained np.where()

case_when() evaluates conditions in order and stops at the first match, like if/elif/else. The default argument fills any row where no condition matched. Use it instead of chained np.where() calls whenever you have three or more branches.

DataFrame.apply(..., axis=1) runs a function once per row, with the whole row available as a Series. Use it when the rule needs more than one column at a time, for example flagging students who are both low-scoring and without internet access:

def at_risk(row: pd.Series) -> bool:
    return row["average_marks"] < 40 and not row["has_internet"]


df["at_risk"] = df.apply(at_risk, axis=1)
df["at_risk"].sum()

np.int64(22)

Common Mistake: Reaching for apply(axis=1) when a vectorized operation already does the job

df.apply(at_risk, axis=1) calls a Python function once per row, 17,190 times here, which is far slower than an equivalent boolean mask: (df[“average_marks”] < 0.4) & (df[“has_internet”] == 0) computes the same result with NumPy operating on whole columns at once. Use apply when the rule can’t be written with column-wise operations and comparisons; reach for masking first.

Activity 1 - Grade Distribution

Write a function that returns True for marks of 0.6 or higher and False otherwise, apply it to average_marks to create a new column passed, then print how many students passed.

def passed_threshold(mark: float) -> bool:
    ...


df["passed"] = df["average_marks"].apply(passed_threshold)
df["passed"].sum()
# -> 6011

# TODO: write passed_threshold, apply it, and print the count of students who passed
...

2. Unique Values, Value Counts, and Membership

.unique(), .value_counts(), and .isin() are usually the first three calls worth making on any categorical column, before any filtering or grouping.

df["program"].unique()

<ArrowStringArray>
['Information Technology', 'Data Science', 'Engineering', 'Computer Science']
Length: 4, dtype: str

.value_counts() counts how many rows fall into each category, sorted from most to least common. Passing normalize=True turns the counts into proportions, which is what you want for a question like “what fraction of students dropped out?”:

Example: Pass rate from value_counts

df[“passed”].value_counts(normalize=True) answers the pass-rate question in a single line, no manual division required.

df["passed"].value_counts(normalize=True)

passed
True     0.951667
False    0.048333
Name: proportion, dtype: float64

.isin() filters rows whose value is in a given list, the categorical equivalent of a boolean comparison. Filtering to the two largest programs combines .value_counts() and .isin() directly:

top_programs = df["program"].value_counts().head(2).index
df_top_programs = df[df["program"].isin(top_programs)]
df_top_programs["program"].value_counts()

program
Data Science        828
Computer Science    726
Name: count, dtype: int64

Pro Tip: .nunique() before .unique() on a column you have not seen yet

.nunique() returns just the count of distinct values. Running it before .unique() tells you whether printing every unique value is even a reasonable idea, useful for a column that turns out to have 4 categories versus one that turns out to have 4,000.

A column with a small, fixed set of values is a candidate for the category dtype: pandas stores each value once and keeps a compact integer code per row instead of repeating the full string, which is most of gender, program, and guardian here:

df["program"] = df["program"].astype("category")
df["program"].dtype

CategoricalDtype(categories=['Computer Science', 'Data Science', 'Engineering',
                  'Information Technology'],
, ordered=False, categories_dtype=str)

Most ML models need numbers, not category labels. pd.get_dummies() one-hot encodes a categorical column into one binary column per category, the standard first step before fitting a model on top of this data:

program_dummies = pd.get_dummies(df["program"], prefix="program")
program_dummies.head()

	program_Computer Science	program_Data Science	program_Engineering	program_Information Technology
0	False	False	False	True
1	False	False	False	True
2	False	False	False	True
3	False	False	False	True
4	False	False	False	True

Key Concept: category dtype for memory, one-hot encoding for models

.astype(“category”) is about storage and speed: it doesn’t change what a column means, only how compactly pandas stores it. pd.get_dummies() is about preparing data for a model that expects numeric input: it turns one categorical column into several binary columns. The two are often used together, category dtype while exploring, dummy columns right before training.

Activity 2 - Guardian Breakdown

Print the value counts for the guardian column, then filter the DataFrame to students whose guardian is “mother” or “father” using .isin(), and print how many rows remain.

df["guardian"].value_counts()

parents = df[df["guardian"].isin(["mother", "father"])]
len(parents)

# TODO: print value counts for guardian, then filter to mother/father and print row count
...

3. Working with Text Columns: the `.str` Accessor

student_id is a string column, and pandas keeps every string method behind a .str accessor rather than directly on the Series. The accessor exists because a Series can hold any dtype, .str is what tells pandas you specifically want the string-handling behaviour.

Key Concept: .str is a namespace: call string methods on the whole column at once

df[‘col’].str.upper() applies upper() to every element without a Python loop. The .str prefix is required: df[‘col’].upper() raises an AttributeError because upper isn’t a Series method. Common methods: .str.strip() for whitespace, .str.contains(pattern) for search, .str.extract(r’(pattern)’) for regex capture groups.

Common Mistake: Calling a string method directly on a Series

df[“student_id”].upper() raises an AttributeError: Series has no upper method. The string methods live on df[“student_id”].str, not on the Series itself, because the Series itself is a general-purpose container that happens to hold strings here.

df["student_id"].str.upper().head(3)

0    S0001
1    S0001
2    S0001
Name: student_id, dtype: str

.str.len() gives the length of every string at once, useful for spotting malformed values before they cause problems downstream:

id_lengths = df["student_id"].str.len()
id_lengths.value_counts()

student_id
5    2400
Name: count, dtype: int64

.str.extract() pulls a regex capture group out of every value, the cleanest way to turn a structured string column into a usable numeric one. Every student_id here is the letter S followed by four digits (S0001–S0400), so extracting just the digits gives a numeric ID:

df["student_id"].str.match(r"^S\d{4}$").all()

np.True_

Verify the transformation on a sample of rows:

numeric_id = df["student_id"].str.extract(r"S(\d+)")[0].astype("Int64")
df["numeric_id"] = numeric_id
df[["student_id", "numeric_id"]].head(3)

	student_id	numeric_id
0	S0001	1
1	S0001	1
2	S0001	1

Activity 3 - Validate and Filter by ID

Use .str.startswith(“S”) to confirm every student_id starts with the letter “S”, then use .str.contains() to count how many contain the digit “0” anywhere in the ID.

df["student_id"].str.startswith("S").all()
df["student_id"].str.contains("0").sum()

# TODO: confirm every student_id starts with "s", then count IDs containing "0"
...

4. Descriptive Statistics and Summarization

.describe() from Chapter 1 summarizes every numeric column with one call. .agg() goes a step further: it computes a chosen list of statistics for a chosen set of columns, in whatever combination you ask for.

Key Concept: .describe() is a first look, not a conclusion

.describe() shows count, mean, std, min, 25th/50th/75th percentile, and max. It skips non-numeric columns by default. .agg([‘mean’,‘std’,‘min’,‘max’]) gives you the same numbers for only the columns and statistics you care about, in a table you can compare at a glance. .corr() is Pearson correlation by default: it measures linear relationship, not causation.

df[["midterm_score", "final_score", "project_score"]].describe()

	midterm_score	final_score	project_score
count	2330.000000	2400.000000	2400.000000
mean	60.518541	59.496958	64.963833
std	13.455787	15.348690	11.096192
min	15.900000	10.000000	18.600000
25%	51.200000	48.900000	57.650000
50%	60.500000	60.000000	65.050000
75%	70.000000	70.000000	72.600000
max	100.000000	100.000000	100.000000

.agg() accepts a list of function names and applies every one of them to every selected column, returning a single small table instead of one Series per statistic:

df[["midterm_score", "final_score", "project_score"]].agg(["mean", "std", "min", "max"])

	midterm_score	final_score	project_score
mean	60.518541	59.496958	64.963833
std	13.455787	15.348690	11.096192
min	15.900000	10.000000	18.600000
max	100.000000	100.000000	100.000000

.corr() computes the pairwise correlation between numeric columns, a quick way to see whether strong performance in one subject tends to come with strong performance in another:

df[["midterm_score", "final_score", "project_score"]].corr()

	midterm_score	final_score	project_score
midterm_score	1.000000	0.205291	0.132373
final_score	0.205291	1.000000	0.196623
project_score	0.132373	0.196623	1.000000

Prefer one .agg([...]) call over five separate .mean(), .std(), .min() calls. The aggregation table is easier to scan and keeps related numbers in one output.

Pro Tip: Use groupby().transform() to add a group statistic back to the original rows

groupby().agg() reduces a DataFrame to one row per group. groupby().transform() does the opposite: it computes the same group statistic but returns a Series the exact same length as the original DataFrame, aligned to the original index, so you can assign it as a new column without a merge:

# agg → one row per program
df.groupby("program")["average_marks"].agg("mean")  # shape (4,)

# transform → one row per student, aligned to original index
df["caste_mean_marks"] = df.groupby("program")["average_marks"].transform("mean")
# every student's row now has their caste's mean, ready for feature engineering

Common use: normalising within groups. (df[“average_marks”] - df[“caste_mean_marks”]) / df.groupby(“program”)[“average_marks”].transform(“std”) z-scores each student relative to their caste group, not across the whole dataset, in two lines without any merge.

Activity 4 - Subject Spread

Use .agg() to compute the mean and standard deviation of midterm_score, final_score, and project_score for students with has_internet == True only.

with_internet = df[df["has_internet"]]
with_internet[["midterm_score", "final_score", "project_score"]].agg(["mean", "std"])

# TODO: agg mean and std for the three mark columns, restricted to internet == 1
...

5. Window Operations and Expressive Queries

Two operations that come up constantly in DS pipelines but don’t fit neatly under “string methods” or “statistics”: rolling windows for time-aware aggregation, and df.query() for readable multi-condition filters.

5a. Rolling and Expanding Windows

rolling(n) groups each row with the n-1 rows before it and applies an aggregation. It is the standard tool for smoothing noisy signals, computing moving averages, or creating features that capture recent trend for a time-ordered dataset.

expanding() is the cumulative version: each row aggregates everything from the start of the series up to that point. The first row is its own mean; the second is the mean of rows 1-2; and so on.

Five day score boxes at top. Three colored brackets slide across: green (days 1-3, mean 80), blue (days 2-4, mean 76.7), purple (days 3-5, mean 75). Amber bar at bottom notes first two results are NaN. — Rolling window of size 3 sliding over five days of scores. Window 1 covers days 1-3, window 2 days 2-4, window 3 days 3-5, each producing a mean. The first two results are NaN.

# Sort by average_marks to give the rolling window a meaningful order here
df_sorted = df.sort_values("average_marks").reset_index(drop=True)

# 5-row rolling mean: smooth out noise
df_sorted["marks_rolling5"] = df_sorted["average_marks"].rolling(window=5).mean()

# expanding mean: cumulative average as we move through sorted students
df_sorted["marks_cumulative"] = df_sorted["average_marks"].expanding().mean()

df_sorted[["student_id", "average_marks", "marks_rolling5", "marks_cumulative"]].head(8)

	student_id	average_marks	marks_rolling5	marks_cumulative
0	S0047	31.100000	NaN	31.100000
1	S0293	32.300000	NaN	31.700000
2	S0291	33.033333	NaN	32.144444
3	S0248	34.800000	NaN	32.808333
4	S0161	36.233333	33.493333	33.493333
5	S0332	36.533333	34.580000	34.000000
6	S0314	36.733333	35.466667	34.390476
7	S0089	36.800000	36.220000	34.691667

Key Concept: Rolling windows require ordered data to be meaningful

rolling(5) looks at the 5 rows immediately before each row in whatever order they sit in the DataFrame. On unsorted data the window picks up random rows and produces meaningless averages. Always sort by the relevant axis (time, score, sequence number) before calling rolling(). The NaN values in the first n-1 rows are correct: there isn’t enough history to fill those windows yet.

In time-series work, the pattern is almost always: sort by date, group by entity, then compute a rolling stat per group. The groupby().transform() tip from Section 4 applies here too:

# Per-program 10-student rolling mean of average_marks
# transform keeps the result aligned to the original index
df["marks_prog_rolling10"] = (
    df.sort_values("average_marks")
    .groupby("program")["average_marks"]
    .transform(lambda s: s.rolling(10, min_periods=3).mean())
)
df[["program", "average_marks", "marks_prog_rolling10"]].dropna().head(6)

	program	average_marks	marks_prog_rolling10
0	Information Technology	56.566667	56.320000
1	Information Technology	59.400000	59.156667
2	Information Technology	64.933333	64.813333
3	Information Technology	62.266667	61.966667
4	Information Technology	51.466667	51.230000
5	Information Technology	71.266667	70.826667

5b. Expressive Filtering with `df.query()`

Boolean masks work well for one or two conditions. For complex multi-condition filters, df.query() expresses the same logic as a readable string, closer to how you would say it out loud.

Reference an external variable inside a query string with the @ prefix:

# Boolean mask version: correct but harder to read
mask = (df["midterm_score"] > 70) & (df["final_score"] > 70) & (df["program"].isin(["Engineering", "Sciences"]))
print(f"mask result  : {mask.sum()} rows")

# Same filter as a query string: reads like a sentence
threshold = 70
result = df.query("midterm_score > @threshold and final_score > @threshold and program in ['Engineering', 'Sciences']")
print(f"query result : {len(result)} rows")

mask result  : 10 rows
query result : 10 rows

Pro Tip: Use query() for readability, masks for dynamic conditions

df.query() is evaluated with numexpr when the library is installed, which makes it faster than the equivalent boolean mask on large DataFrames. The downside: the query string is harder to build programmatically. Use query() when you’re writing a fixed filter that a reader should understand at a glance; use a mask when the conditions are assembled at runtime from user input or a config.

Activity 5 - Smoothed Grade Distribution

Sort the full DataFrame by average_marks ascending. Compute a 20-student rolling mean of average_marks. Then use df.query() to find students with a rolling mean above 75 who are in the “Sciences” program. Print how many rows match.

df_q = df.sort_values("average_marks").reset_index(drop=True)
df_q["rolling_mean"] = df_q["average_marks"].rolling(20).mean()
result = df_q.query("rolling_mean > 75 and program == 'Sciences'")
print(len(result))

# TODO: rolling mean, then query-filter to Sciences students above rolling threshold
...

Capstone: Risk and Performance Report

Combine everything from this notebook into one short report: a derived grade column, a categorical breakdown, a text-column check, and a summary statistic, the same operations used in any first pass over a new dataset.

Capstone Exercise - Risk and Performance Report

Confirm every student_id matches the pattern S followed by 4 digits, using .str.match() (Sec. 3)
Print the final_grade distribution as proportions with .value_counts(normalize=True) (Sec. 1, Sec. 2)
Use .agg() to compute the mean average_marks for passed vs failed students separately (Sec. 4)

valid_ids = df["student_id"].str.match(r"^S\d{4}$").all()

grade_distribution = df["final_grade"].value_counts(normalize=True)

passing = df[df["passed"] == True]["average_marks"].agg(["mean"])  # noqa: E712
failing = df[df["passed"] == False]["average_marks"].agg(["mean"])  # noqa: E712

# TODO: build the risk and performance report described above
...

Resource	Why it matters
McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly.	Chapter 7 (data cleaning) and Chapter 16 (aggregation) are the canonical references for the operations in this notebook
pandas documentation: Indexing and selecting data	Covers `.loc`, `.iloc`, boolean indexing, and `MultiIndex` in exhaustive detail
pandas documentation: Group by: split-apply-combine	Official guide to `groupby`, `transform`, and `apply`; the examples use the same column types as this notebook
Wickham, H. (2014). Tidy data. Journal of Statistical Software 59(10).	The conceptual framework behind `.melt()` and `.pivot_table()` introduced later in Chapter 10

Summary

Concept	Key rule
`str` dtype	Pandas 3’s default for text columns, backed by PyArrow, replacing `object`
Copy-on-Write	Selecting a subset always behaves as an independent copy; the original is never modified through it
`Series.map()`	One-to-one substitution with a dict or simple function
`Series.apply()`	Run any function, including multi-branch logic, one value at a time
`DataFrame.apply(axis=1)`	Run a function once per row, with the whole row available
Vectorized ops vs `apply`	Prefer a boolean mask or arithmetic when one exists; `apply` is the fallback, not the default
`.value_counts(normalize=True)`	Category proportions in one call
`.isin()`	Filter rows whose value is in a given list
`.astype("category")`	Compact storage for a column with a small, fixed set of values
`pd.get_dummies()`	One-hot encode a categorical column before fitting a model
`.str` accessor	Required for any string method on a Series; calling the method directly raises `AttributeError`
`.str.extract()`	Pull a regex capture group into a new column
`.agg([...])`	Compute several statistics for several columns in one call, instead of chaining single-statistic calls
`rolling(n)`	Compute a statistic over a sliding window of n rows; sort first so the window is meaningful
`expanding()`	Cumulative statistic from the start of the series to the current row
`groupby().transform(lambda s: s.rolling(...))`	Per-group rolling stat, result aligned to original index
`df.query("expr")`	Readable multi-condition filter; use `@var` to reference external variables

Next: 10-combining-reshaping.ipynb, covering concatenation, merging, groupby, pivot tables, and time series.

What’s new in Pandas 3

1. Function Application with map and apply

2. Unique Values, Value Counts, and Membership

3. Working with Text Columns: the .str Accessor