Part 9: Pandas Operations

Open In Colab Download Notebook

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook continues directly from Part 8 (08-pandas-core.ipynb). It assumes you can read a CSV, select rows and columns with .loc/.iloc, and filter with a boolean mask. We reload the same dataset from scratch below, so you can run this notebook on its own.

Part 8 covered the dataset’s shape and structure. Part 9 covers the operations that turn raw columns into derived ones: applying your own functions row-by-row, counting and filtering categorical values, cleaning text columns, summarizing a dataset, and computing window statistics.

Part 10 (10-combining-reshaping.ipynb) continues with combining datasets, groupby, pivoting, and time series.

Topic Why it matters
Pandas 3: str dtype and Copy-on-Write Two behaviour changes that affect every example in this notebook and the ones before it
map and apply Run your own function over a Series or DataFrame when there is no built-in vectorized operation
Value counts, membership, and encoding Understand a categorical column, filter on a set of values, and prepare it for a model
The .str accessor String methods on a Series need .str; calling them directly is the most common error here
Descriptive statistics .describe() and .agg() summarize a dataset in one call instead of many
Rolling and expanding windows Compute moving averages and cumulative statistics; the foundation of time-series feature engineering
df.query() Express multi-condition filters as a readable string; faster with numexpr on large DataFrames

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 9 you will be able to:

# Skill Covered in
1 Recognise pandas 3’s str dtype and Copy-on-Write behaviour Sec. 0
2 Apply your own function to a Series or DataFrame with map and apply Sec. 1
3 Count, filter, and encode categorical columns Sec. 2
4 Clean and query text columns with the .str accessor Sec. 3
5 Summarize a dataset with descriptive statistics in one call Sec. 4
6 Compute rolling and expanding window statistics Sec. 5a
7 Write readable multi-condition filters with df.query() Sec. 5b
import numpy as np
import pandas as pd

df = pd.read_csv("data/university_analytics.csv")
df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df.head()
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed average_marks
0 S0001 2023 Information Technology F South Father True C01 Python Programming Fall 2023 2023-09-04 10.8 88.6 46.1 54.4 57.8 D True 52.766667
1 S0001 2023 Information Technology F South Father True C02 Statistics Spring 2024 2024-01-15 16.1 71.5 49.9 64.9 67.0 C True 60.600000
2 S0001 2023 Information Technology F South Father True C03 Data Structures Fall 2023 2023-09-04 23.6 71.5 57.3 64.1 83.0 C True 68.133333
3 S0001 2023 Information Technology F South Father True C04 Linear Algebra Spring 2024 2024-01-15 4.7 78.3 75.6 59.5 64.3 C True 66.466667
4 S0001 2023 Information Technology F South Father True C05 Machine Learning Fall 2023 2023-09-04 16.2 63.2 53.6 59.4 57.0 C True 56.666667

0. What’s New in Pandas 3

This series is written against pandas 3, released in 2025, and two of its changes are worth understanding before going further, because they change what you see in everyday output.

Run df.dtypes and look at the text columns. In pandas 2 they showed up as object, the catch-all dtype for anything that is not a fixed-width number. Pandas 3 gives strings a dedicated str dtype instead, backed by PyArrow when it is installed:

df.dtypes
student_id             str
cohort               int64
program                str
gender                 str
region                 str
guardian               str
has_internet          bool
course_id              str
course                 str
semester               str
enrollment_date        str
study_hours        float64
attendance_pct     float64
midterm_score      float64
final_score        float64
project_score      float64
final_grade            str
passed                bool
average_marks      float64
dtype: object

Key Concept: A dedicated str dtype, not object

object dtype could hold strings, but it could just as easily hold a mix of strings, numbers, and Python objects, with no guarantee which. The new str dtype can only hold strings or missing values, so a column typed str is a stronger guarantee than object ever was, and the PyArrow backing makes string operations faster. This is the same dtype the .str accessor in Sec. 3 operates on.

The second change does not show up in any output, it changes how assignment behaves. Pandas 3 makes Copy-on-Write the only behaviour: selecting a subset of a DataFrame never lets you accidentally modify the original through it.

subset = df[["student_id", "average_marks"]]
subset.loc[0, "average_marks"] = 0.0
print(f"original unaffected: {df.loc[0, 'average_marks']}")
print(f"subset changed     : {subset.loc[0, 'average_marks']}")
original unaffected: 52.76666666666667
subset changed     : 0.0

Key Concept: Copy-on-Write removes the SettingWithCopyWarning guessing game

Before pandas 3, whether subset above was a view into df or an independent copy depended on details of how subset was created, the source of the infamous SettingWithCopyWarning. Pandas 3 makes every such selection behave as an independent copy: modifying subset never touches df. The warning is gone because the ambiguity it warned about is gone.

Pro Tip: pd.col() previews the expression style Polars uses

Pandas 3 adds pd.col(“name”) as an alternative to a lambda inside .assign(): df.assign(c=pd.col(“a”) + pd.col(“b”)) instead of df.assign(c=lambda d: d[“a”] + d[“b”]). It reads closer to the column-expression style used throughout Polars, covered in Part 11 of this series, and is worth recognising even though this notebook keeps using apply and lambda, the more widely used style today.

1. Function Application with map and apply

NumPy’s vectorized operations cover arithmetic and comparisons, but sometimes the transformation you need is your own logic: a lookup, a multi-branch rule, a calculation that combines several columns per row. map and apply are how pandas runs that logic.

Series.map() is for simple one-to-one substitution: pass it a dict (or a function) and it replaces every value with its mapped equivalent. Recoding the gender column’s short codes into full labels is a map problem, not an apply one:

gender_labels = {"M": "Male", "F": "Female"}
df["gender_label"] = df["gender"].map(gender_labels)
df[["gender", "gender_label"]].head(3)
gender gender_label
0 F Female
1 F Female
2 F Female

Key Concept: map is for substitution, apply is for logic

map expects a dict or a simple function and only ever looks at one value at a time. apply accepts any function, including one with branching logic, and can run over a Series (one value at a time) or a DataFrame (one row or one column at a time, depending on axis). Reach for map first; reach for apply when the rule does not fit in a lookup table.

Converting a normalized mark (0 to 1) into a letter grade needs a multi-branch rule, a job for apply with a plain Python function:

def letter_grade(mark: float) -> str:
    if mark >= 80:
        return "A"
    elif mark >= 60:
        return "B"
    elif mark >= 40:
        return "C"
    else:
        return "D"


df["grade"] = df["average_marks"].apply(letter_grade)
df[["student_id", "average_marks", "grade"]].head()
student_id average_marks grade
0 S0001 52.766667 C
1 S0001 60.600000 B
2 S0001 68.133333 B
3 S0001 66.466667 B
4 S0001 56.666667 C

DataFrame.apply(..., axis=1) runs a function once per row, with the whole row available as a Series. Use it when the rule needs more than one column at a time, for example flagging students who are both low-scoring and without internet access:

def at_risk(row: pd.Series) -> bool:
    return row["average_marks"] < 40 and not row["has_internet"]


df["at_risk"] = df.apply(at_risk, axis=1)
df["at_risk"].sum()
np.int64(12)

Common Mistake: Reaching for apply(axis=1) when a vectorized operation already does the job

df.apply(at_risk, axis=1) calls a Python function once per row, 17,190 times here, which is far slower than an equivalent boolean mask: (df[“average_marks”] < 0.4) & (df[“has_internet”] == 0) computes the same result with NumPy operating on whole columns at once. Use apply when the rule cannot be written with column-wise operations and comparisons; reach for masking first.

Activity 1 - Grade Distribution

Goal: Write a function that returns True for marks of 0.6 or higher and False otherwise, apply it to average_marks to create a new column passed, then print how many students passed.
def passed_threshold(mark: float) -> bool:
    ...


df["passed"] = df["average_marks"].apply(passed_threshold)
df["passed"].sum()
# -> 6011
# TODO: write passed_threshold, apply it, and print the count of students who passed
...

2. Unique Values, Value Counts, and Membership

.unique(), .value_counts(), and .isin() are usually the first three calls worth making on any categorical column, before any filtering or grouping.

df["program"].unique()
<ArrowStringArray>
['Information Technology', 'Data Science', 'Engineering', 'Computer Science']
Length: 4, dtype: str

.value_counts() counts how many rows fall into each category, sorted from most to least common. Passing normalize=True turns the counts into proportions, which is what you want for a question like “what fraction of students dropped out?”:

Example: Pass rate from value_counts

df[“passed”].value_counts(normalize=True) answers the pass-rate question in a single line, no manual division required.

df["passed"].value_counts(normalize=True)
passed
True     0.961667
False    0.038333
Name: proportion, dtype: float64

.isin() filters rows whose value is in a given list, the categorical equivalent of a boolean comparison. Filtering to the two largest programs combines .value_counts() and .isin() directly:

top_programs = df["program"].value_counts().head(2).index
df_top_programs = df[df["program"].isin(top_programs)]
df_top_programs["program"].value_counts()
program
Data Science        828
Computer Science    726
Name: count, dtype: int64

Pro Tip: .nunique() before .unique() on a column you have not seen yet

.nunique() returns just the count of distinct values. Running it before .unique() tells you whether printing every unique value is even a reasonable idea, useful for a column that turns out to have 4 categories versus one that turns out to have 4,000.

A column with a small, fixed set of values is a candidate for the category dtype: pandas stores each value once and keeps a compact integer code per row instead of repeating the full string, which is most of gender, program, and guardian here:

df["program"] = df["program"].astype("category")
df["program"].dtype
CategoricalDtype(categories=['Computer Science', 'Data Science', 'Engineering',
                  'Information Technology'],
, ordered=False, categories_dtype=str)

Most ML models need numbers, not category labels. pd.get_dummies() one-hot encodes a categorical column into one binary column per category, the standard first step before fitting a model on top of this data:

program_dummies = pd.get_dummies(df["program"], prefix="program")
program_dummies.head()
program_Computer Science program_Data Science program_Engineering program_Information Technology
0 False False False True
1 False False False True
2 False False False True
3 False False False True
4 False False False True

Key Concept: category dtype for memory, one-hot encoding for models

.astype(“category”) is about storage and speed: it does not change what a column means, only how compactly pandas stores it. pd.get_dummies() is about preparing data for a model that expects numeric input: it turns one categorical column into several binary columns. The two are often used together, category dtype while exploring, dummy columns right before training.

Activity 2 - Guardian Breakdown

Goal: Print the value counts for the guardian column, then filter the DataFrame to students whose guardian is “mother” or “father” using .isin(), and print how many rows remain.
df["guardian"].value_counts()

parents = df[df["guardian"].isin(["mother", "father"])]
len(parents)
# TODO: print value counts for guardian, then filter to mother/father and print row count
...

3. Working with Text Columns: the .str Accessor

student_id is a string column, and pandas keeps every string method behind a .str accessor rather than directly on the Series. The accessor exists because a Series can hold any dtype, .str is what tells pandas you specifically want the string-handling behaviour.

Common Mistake: Calling a string method directly on a Series

df[“student_id”].upper() raises an AttributeError: Series has no upper method. The string methods live on df[“student_id”].str, not on the Series itself, because the Series itself is a general-purpose container that happens to hold strings here.

df["student_id"].str.upper().head(3)
0    S0001
1    S0001
2    S0001
Name: student_id, dtype: str

.str.len() gives the length of every string at once, useful for spotting malformed values before they cause problems downstream:

id_lengths = df["student_id"].str.len()
id_lengths.value_counts()
student_id
5    2400
Name: count, dtype: int64

.str.extract() pulls a regex capture group out of every value, the cleanest way to turn a structured string column into a usable numeric one. Every student_id here is the letter S followed by four digits (S0001S0400), so extracting just the digits gives a numeric ID:

df["student_id"].str.match(r"^S\d{4}$").all()
np.True_
numeric_id = df["student_id"].str.extract(r"S(\d+)")[0].astype("Int64")
df["numeric_id"] = numeric_id
df[["student_id", "numeric_id"]].head(3)
student_id numeric_id
0 S0001 1
1 S0001 1
2 S0001 1

Activity 3 - Validate and Filter by ID

Goal: Use .str.startswith(“S”) to confirm every student_id starts with the letter “S”, then use .str.contains() to count how many contain the digit “0” anywhere in the ID.
df["student_id"].str.startswith("S").all()
df["student_id"].str.contains("0").sum()
# TODO: confirm every student_id starts with "s", then count IDs containing "0"
...

4. Descriptive Statistics and Summarization

.describe() from Part 1 summarizes every numeric column with one call. .agg() goes a step further: it computes a chosen list of statistics for a chosen set of columns, in whatever combination you ask for.

df[["midterm_score", "final_score", "project_score"]].describe()
midterm_score final_score project_score
count 2330.000000 2400.000000 2400.000000
mean 60.846481 59.819500 65.280292
std 13.019003 15.076345 10.518972
min 17.000000 10.000000 15.000000
25% 51.725000 49.800000 58.300000
50% 60.850000 60.150000 65.400000
75% 69.600000 70.000000 72.500000
max 100.000000 100.000000 100.000000

.agg() accepts a list of function names and applies every one of them to every selected column, returning a single small table instead of one Series per statistic:

df[["midterm_score", "final_score", "project_score"]].agg(["mean", "std", "min", "max"])
midterm_score final_score project_score
mean 60.846481 59.819500 65.280292
std 13.019003 15.076345 10.518972
min 17.000000 10.000000 15.000000
max 100.000000 100.000000 100.000000

.corr() computes the pairwise correlation between numeric columns, a quick way to see whether strong performance in one subject tends to come with strong performance in another:

df[["midterm_score", "final_score", "project_score"]].corr()
midterm_score final_score project_score
midterm_score 1.000000 0.166264 0.058005
final_score 0.166264 1.000000 0.143973
project_score 0.058005 0.143973 1.000000

Pro Tip: .agg() over chaining several single-statistic calls

Five separate lines like df[“midterm_score”].mean(), .std(), .min() read fine in isolation but scatter related numbers across separate outputs. One .agg([…]) call keeps them in a single table that is easier to read and to compare across columns.

Pro Tip: Use groupby().transform() to add a group statistic back to the original rows

groupby().agg() reduces a DataFrame to one row per group. groupby().transform() does the opposite: it computes the same group statistic but returns a Series the exact same length as the original DataFrame, aligned to the original index, so you can assign it as a new column without a merge:

# agg → one row per program
df.groupby("program")["average_marks"].agg("mean")  # shape (4,)

# transform → one row per student, aligned to original index
df["caste_mean_marks"] = df.groupby("program")["average_marks"].transform("mean")
# every student's row now has their caste's mean, ready for feature engineering

Common use: normalising within groups. (df[“average_marks”] - df[“caste_mean_marks”]) / df.groupby(“program”)[“average_marks”].transform(“std”) z-scores each student relative to their caste group, not across the whole dataset, in two lines without any merge.

Activity 4 - Subject Spread

Goal: Use .agg() to compute the mean and standard deviation of midterm_score, final_score, and project_score for students with has_internet == True only.
with_internet = df[df["has_internet"]]
with_internet[["midterm_score", "final_score", "project_score"]].agg(["mean", "std"])
# TODO: agg mean and std for the three mark columns, restricted to internet == 1
...

5. Window Operations and Expressive Queries

Two operations that come up constantly in DS pipelines but do not fit neatly under “string methods” or “statistics”: rolling windows for time-aware aggregation, and df.query() for readable multi-condition filters.

5a. Rolling and Expanding Windows

rolling(n) groups each row with the n-1 rows before it and applies an aggregation. It is the standard tool for smoothing noisy signals, computing moving averages, or creating features that capture recent trend for a time-ordered dataset.

expanding() is the cumulative version: each row aggregates everything from the start of the series up to that point. The first row is its own mean; the second is the mean of rows 1-2; and so on.

Rolling window (size=3): slides one step at a time

flowchart LR
    subgraph W1 ["window 1 (size=3)"]
        A1["day 1: 70"] --> A2["day 2: 80"] --> A3["day 3: 90"]
    end
    subgraph W2 ["window 2 (size=3)"]
        B1["day 2: 80"] --> B2["day 3: 90"] --> B3["day 4: 60"]
    end
    subgraph W3 ["window 3 (size=3)"]
        C1["day 3: 90"] --> C2["day 4: 60"] --> C3["day 5: 75"]
    end
    W1 -->|mean = 80.0| R1["result[2]"]
    W2 -->|mean = 76.7| R2["result[3]"]
    W3 -->|mean = 75.0| R3["result[4]"]

    style R1 fill:#EBF5F0,stroke:#059669,color:#065F46
    style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764

# Sort by average_marks to give the rolling window a meaningful order here
df_sorted = df.sort_values("average_marks").reset_index(drop=True)

# 5-row rolling mean: smooth out noise
df_sorted["marks_rolling5"] = df_sorted["average_marks"].rolling(window=5).mean()

# expanding mean: cumulative average as we move through sorted students
df_sorted["marks_cumulative"] = df_sorted["average_marks"].expanding().mean()

df_sorted[["student_id", "average_marks", "marks_rolling5", "marks_cumulative"]].head(8)
student_id average_marks marks_rolling5 marks_cumulative
0 S0047 34.300000 NaN 34.300000
1 S0291 35.233333 NaN 34.766667
2 S0013 35.533333 NaN 35.022222
3 S0012 37.366667 NaN 35.608333
4 S0293 37.500000 35.986667 35.986667
5 S0354 37.600000 36.646667 36.255556
6 S0102 38.433333 37.286667 36.566667
7 S0382 38.466667 37.873333 36.804167

Key Concept: Rolling windows require ordered data to be meaningful

rolling(5) looks at the 5 rows immediately before each row in whatever order they sit in the DataFrame. On unsorted data the window picks up random rows and produces meaningless averages. Always sort by the relevant axis (time, score, sequence number) before calling rolling(). The NaN values in the first n-1 rows are correct: there is not enough history to fill those windows yet.

In time-series work, the pattern is almost always: sort by date, group by entity, then compute a rolling stat per group. The groupby().transform() tip from Section 4 applies here too:

# Per-program 10-student rolling mean of average_marks
# transform keeps the result aligned to the original index
df["marks_prog_rolling10"] = (
    df.sort_values("average_marks")
    .groupby("program")["average_marks"]
    .transform(lambda s: s.rolling(10, min_periods=3).mean())
)
df[["program", "average_marks", "marks_prog_rolling10"]].dropna().head(6)
program average_marks marks_prog_rolling10
0 Information Technology 52.766667 52.410000
1 Information Technology 60.600000 60.450000
2 Information Technology 68.133333 67.816667
3 Information Technology 66.466667 66.356667
4 Information Technology 56.666667 56.316667
5 Information Technology 69.466667 68.976667

5b. Expressive Filtering with df.query()

Boolean masks work well for one or two conditions. For complex multi-condition filters, df.query() expresses the same logic as a readable string, closer to how you would say it out loud.

Reference an external variable inside a query string with the @ prefix:

# Boolean mask version -- correct but harder to read
mask = (df["midterm_score"] > 70) & (df["final_score"] > 70) & (df["program"].isin(["Engineering", "Sciences"]))
print(f"mask result  : {mask.sum()} rows")

# Same filter as a query string -- reads like a sentence
threshold = 70
result = df.query("midterm_score > @threshold and final_score > @threshold and program in ['Engineering', 'Sciences']")
print(f"query result : {len(result)} rows")
mask result  : 14 rows
query result : 14 rows

Pro Tip: Use query() for readability, masks for dynamic conditions

df.query() is evaluated with numexpr when the library is installed, which makes it faster than the equivalent boolean mask on large DataFrames. The downside: the query string is harder to build programmatically. Use query() when you are writing a fixed filter that a reader should understand at a glance; use a mask when the conditions are assembled at runtime from user input or a config.

Activity 5 - Smoothed Grade Distribution

Goal: Sort the full DataFrame by average_marks ascending. Compute a 20-student rolling mean of average_marks. Then use df.query() to find students with a rolling mean above 75 who are in the “Sciences” program. Print how many rows match.
df_q = df.sort_values("average_marks").reset_index(drop=True)
df_q["rolling_mean"] = df_q["average_marks"].rolling(20).mean()
result = df_q.query("rolling_mean > 75 and program == 'Sciences'")
print(len(result))
# TODO: rolling mean, then query-filter to Sciences students above rolling threshold
...

Capstone: Risk and Performance Report

Combine everything from this notebook into one short report: a derived grade column, a categorical breakdown, a text-column check, and a summary statistic, the same operations used in any first pass over a new dataset.

Capstone Exercise - Risk and Performance Report

Goal:
  1. Confirm every student_id matches the pattern S followed by 4 digits, using .str.match() (Sec. 3)
  2. Print the final_grade distribution as proportions with .value_counts(normalize=True) (Sec. 1, Sec. 2)
  3. Use .agg() to compute the mean average_marks for passed vs failed students separately (Sec. 4)
valid_ids = df["student_id"].str.match(r"^S\d{4}$").all()

grade_distribution = df["final_grade"].value_counts(normalize=True)

passing = df[df["passed"] == True]["average_marks"].agg(["mean"])  # noqa: E712
failing = df[df["passed"] == False]["average_marks"].agg(["mean"])  # noqa: E712
# TODO: build the risk and performance report described above
...

Further Reading

Resource Why it matters
McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly. Chapter 7 (data cleaning) and Chapter 10 (aggregation) are the canonical references for the operations in this notebook
pandas documentation — Indexing and selecting data Covers .loc, .iloc, boolean indexing, and MultiIndex in exhaustive detail
pandas documentation — Group by: split-apply-combine Official guide to groupby, transform, and apply; the examples use the same column types as this notebook
Wickham, H. (2014). Tidy data. Journal of Statistical Software 59(10). The conceptual framework behind .melt() and .pivot_table() introduced later in Part 10

Summary

Concept Key rule
str dtype Pandas 3’s default for text columns, backed by PyArrow, replacing object
Copy-on-Write Selecting a subset always behaves as an independent copy; the original is never modified through it
Series.map() One-to-one substitution with a dict or simple function
Series.apply() Run any function, including multi-branch logic, one value at a time
DataFrame.apply(axis=1) Run a function once per row, with the whole row available
Vectorized ops vs apply Prefer a boolean mask or arithmetic when one exists; apply is the fallback, not the default
.value_counts(normalize=True) Category proportions in one call
.isin() Filter rows whose value is in a given list
.astype("category") Compact storage for a column with a small, fixed set of values
pd.get_dummies() One-hot encode a categorical column before fitting a model
.str accessor Required for any string method on a Series; calling the method directly raises AttributeError
.str.extract() Pull a regex capture group into a new column
.agg([...]) Compute several statistics for several columns in one call, instead of chaining single-statistic calls
rolling(n) Compute a statistic over a sliding window of n rows; sort first so the window is meaningful
expanding() Cumulative statistic from the start of the series to the current row
groupby().transform(lambda s: s.rolling(...)) Per-group rolling stat, result aligned to original index
df.query("expr") Readable multi-condition filter; use @var to reference external variables

Next: 10-combining-reshaping.ipynb, covering concatenation, merging, groupby, pivot tables, and time series.