Part 8: Pandas Core

Open In Colab Download Notebook

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Parts 1-7 (the Python Basics section), especially the NumPy notebook (Part 4): pandas builds directly on NumPy arrays, and a lot of what looks new here is really a NumPy idea wearing a label.

Every example in this part uses a real dataset: exam results for 2,400 students across several schools, with columns covering marks, teacher counts, school size, and whether each student continued or dropped out. Real data means real messiness, missing values included, which is exactly what makes it worth learning on.

Part 9 (09-pandas-operations.ipynb) continues with row-wise transformations, string methods, and descriptive statistics on this same dataset.

Topic Why it matters
Series and DataFrame The two objects every pandas operation works on
Reading data pd.read_csv and the first checks every dataset deserves
Selecting rows and columns loc vs iloc, and why mixing them up is the most common pandas bug
Boolean filtering The same masking idea from NumPy, applied to a labelled table
Missing data Real datasets have gaps; pandas gives you tools to see and handle them, not ignore them
Memory and categorical dtype Profile what a DataFrame costs in RAM; reduce it for columns with few distinct values

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 8 you will be able to:

# Skill Covered in
1 Explain the relationship between a Series, a DataFrame, and a NumPy array Sec. 1
2 Load a CSV and run the first checks any new dataset deserves Sec. 2
3 Select columns and rows correctly with loc and iloc Sec. 3
4 Filter rows with boolean conditions Sec. 4
5 Add, modify, and drop columns without the inplace trap Sec. 5
6 Sort a DataFrame by one or more columns Sec. 6
7 Find, count, and handle missing values Sec. 7
8 Profile memory usage and reduce it with the categorical dtype Sec. 8

0. The Tool That Thinks in Tables

You have a CSV. It has 2,400 rows, 10 columns, and three questions from your manager: which schools have the highest pass rate? Is there a relationship between teacher count and average score? How did results change between semesters?

NumPy can store the numbers. A plain Python dict can hold the column names. But neither of them has a concept of a labelled row, a named column, or a group-by operation. You would spend more time writing plumbing than answering the questions.

pandas (pandas.pydata.org) was created at AQR Capital in 2008 by Wes McKinney to bring the R data frame into Python. The idea was to give Python scientists a two-dimensional, labelled data structure with built-in operations for filtering, grouping, joining, and reshaping — the things you actually do with tabular data, every day. It became the de-facto standard for data analysis in Python and the input format expected by every ML library in the ecosystem.

How it compares to alternatives

Library Strengths When to reach for it instead
pandas (pandas.pydata.org) Mature, enormous ecosystem, readable API Your default for anything up to a few million rows
Polars (pola.rs) Rust core, multi-threaded, lazy evaluation Datasets too large for pandas to fit comfortably in RAM; see Part 11
Dask (dask.org) Distributed pandas on a cluster Multi-machine computation, out-of-core data
Modin (modin.readthedocs.io) pandas API, parallel execution Parallelism without changing your pandas code
cuDF (docs.rapids.ai) GPU-accelerated pandas Large datasets on NVIDIA GPUs

For most data science and ML work pandas is the right starting point. You will meet Polars in Part 11, when the scenario specifically calls for speed at scale.

Already in your environment

pandas is in pyproject.toml. For a standalone project:

uv add pandas          # or: pip install pandas

Official docs: pandas.pydata.org/docs

1. Series and DataFrame

pandas gives you two data structures, and almost everything else in the library is a method on one of them. A Series is a one-dimensional, labelled array: a NumPy array with an index attached. A DataFrame is a two-dimensional table: a collection of Series that all share the same index, one Series per column.

import numpy as np
import pandas as pd

# A Series is a NumPy array plus a label for every position.
marks = pd.Series([62, 78, 91], index=["s001", "s002", "s003"])
print(marks)
print(f"underlying NumPy array: {marks.to_numpy()}")
s001    62
s002    78
s003    91
dtype: int64
underlying NumPy array: [62 78 91]

Relationship: Series are columns, DataFrame shares one Index

flowchart TB
    subgraph S ["Series (1-D)"]
        I1["Index: 0
1
2"] --- V1["Values: 85.0
72.0
91.0"]
        LB["name: midterm_score"]
    end
    subgraph DF ["DataFrame (2-D)"]
        direction LR
        COL1["Series: midterm_score"]
        COL2["Series: final_score"]
        COL3["Series: program"]
        IDX["shared Index: S0001, S0002, S0003"]
    end
    S -->|"pd.DataFrame({'mid': s1, 'fin': s2})"| DF

    style DF fill:#EAF3FA,stroke:#0369A1
    style S fill:#EBF5F0,stroke:#059669

A DataFrame is what you get when several Series that share an index sit next to each other as columns. Building one from a dict of equal-length arrays, the same pattern used to build the feature matrices in the NumPy notebook, makes this explicit:

data = pd.DataFrame(
    {
        "student_id": ["s001", "s002", "s003"],
        "midterm_score": [62, 78, 91],
        "gender": ["F", "M", "F"],
    }
)
data
student_id midterm_score gender
0 s001 62 F
1 s002 78 M
2 s003 91 F

Key Concept: A DataFrame is a dict of same-length Series

Every column of a DataFrame is a Series sharing the DataFrame’s index. data[“midterm_score”] returns that column as a standalone Series, with the same index, the same values, and every NumPy method still available on it through .to_numpy(). This is why the array skills from the NumPy notebook keep paying off here: a pandas column is a NumPy array with row labels attached, not a different kind of thing.

2. Reading and Inspecting Data

pd.read_csv is almost always the first line of a real analysis. The first thing to do with whatever it returns is never to start analysing it: it is to check what you actually got.

df = pd.read_csv("data/university_analytics.csv")
print(f"shape   : {df.shape}")
df.head()
shape   : (2400, 18)
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed
0 S0001 2023 Information Technology F South Father True C01 Python Programming Fall 2023 2023-09-04 10.8 88.6 46.1 54.4 57.8 D True
1 S0001 2023 Information Technology F South Father True C02 Statistics Spring 2024 2024-01-15 16.1 71.5 49.9 64.9 67.0 C True
2 S0001 2023 Information Technology F South Father True C03 Data Structures Fall 2023 2023-09-04 23.6 71.5 57.3 64.1 83.0 C True
3 S0001 2023 Information Technology F South Father True C04 Linear Algebra Spring 2024 2024-01-15 4.7 78.3 75.6 59.5 64.3 C True
4 S0001 2023 Information Technology F South Father True C05 Machine Learning Fall 2023 2023-09-04 16.2 63.2 53.6 59.4 57.0 C True

.dtypes shows what type pandas inferred for every column, .info() adds memory usage and non-null counts in one call, and .describe() summarises every numeric column at once:

df.dtypes
student_id             str
cohort               int64
program                str
gender                 str
region                 str
guardian               str
has_internet          bool
course_id              str
course                 str
semester               str
enrollment_date        str
study_hours        float64
attendance_pct     float64
midterm_score      float64
final_score        float64
project_score      float64
final_grade            str
passed                bool
dtype: object
df.info()
<class 'pandas.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   student_id       2400 non-null   str    
 1   cohort           2400 non-null   int64  
 2   program          2400 non-null   str    
 3   gender           2400 non-null   str    
 4   region           2400 non-null   str    
 5   guardian         2400 non-null   str    
 6   has_internet     2400 non-null   bool   
 7   course_id        2400 non-null   str    
 8   course           2400 non-null   str    
 9   semester         2400 non-null   str    
 10  enrollment_date  2400 non-null   str    
 11  study_hours      2400 non-null   float64
 12  attendance_pct   2327 non-null   float64
 13  midterm_score    2330 non-null   float64
 14  final_score      2400 non-null   float64
 15  project_score    2400 non-null   float64
 16  final_grade      2400 non-null   str    
 17  passed           2400 non-null   bool   
dtypes: bool(2), float64(5), int64(1), str(10)
memory usage: 469.2 KB
df.describe().T
count mean std min 25% 50% 75% max
cohort 2400.0 2023.007500 0.798567 2022.0 2022.000 2023.00 2024.0 2024.0
study_hours 2400.0 18.029333 6.881890 2.0 13.300 17.80 22.6 45.0
attendance_pct 2327.0 75.057542 14.115225 28.2 64.800 75.00 85.2 100.0
midterm_score 2330.0 60.846481 13.019003 17.0 51.725 60.85 69.6 100.0
final_score 2400.0 59.819500 15.076345 10.0 49.800 60.15 70.0 100.0
project_score 2400.0 65.280292 10.518972 15.0 58.300 65.40 72.5 100.0

Pro Tip: Run these three checks on every new dataset, before anything else

.shape, .dtypes, and .describe() take seconds and catch most of the embarrassing mistakes early: a column pandas read as text when it should be numeric, a date stored as a string, a column that is almost entirely one value. Cheaper to catch here than three analysis steps later.

Activity 1 - First Look

Goal: Print the column names with df.columns, then print how many unique values the caste column has using .nunique().
df.columns
df["program"].nunique()
# -> 4
# TODO: print df.columns, then the number of unique values in "program"
...

3. Selecting Rows and Columns

df["col"] selects one column as a Series; df[["col1", "col2"]] selects several columns as a DataFrame, note the double brackets. Selecting rows by label or by position needs .loc or .iloc, and the two are not interchangeable.

one_column = df["midterm_score"]
several_columns = df[["student_id", "midterm_score", "final_score"]]
print(type(one_column), type(several_columns))
<class 'pandas.Series'> <class 'pandas.DataFrame'>

.loc selects by label: the value in the index, or a column name. .iloc selects by integer position, exactly like a NumPy array, regardless of what the labels actually are:

# .loc: by label. Row label 0, column label "midterm_score".
print(df.loc[0, "midterm_score"])

# .iloc: by position. First row, second column, whatever they are labelled.
print(df.iloc[0, 1])

# A label-based row slice with .loc is inclusive of both ends.
df.loc[0:2, ["student_id", "midterm_score"]]
46.1
2023
student_id midterm_score
0 S0001 46.1
1 S0001 49.9
2 S0001 57.3

Common Mistake: Treating .loc and .iloc as interchangeable

df.loc[0:2] and df.iloc[0:2] can return a different number of rows. .iloc slicing behaves exactly like a Python list: the end is excluded. .loc slicing is inclusive of the end label, because that label might not even be an integer; it could be a date or a name. The two only happen to look similar when the index is the default 0, 1, 2, … range, which is exactly when this mistake hides best.

Activity 2 - Select a Slice

Goal: Using .iloc, select rows 10 through 14 (5 rows) and only the student_id and project_score columns.
df.iloc[10:15, [?, ?]]
# Hint: find the integer position of each column with df.columns.get_loc(...)
# TODO: select rows 10-14 and the student_id/project_score columns using .iloc
...

4. Boolean Filtering

This is the same trick from the NumPy notebook’s boolean masking section, applied to a DataFrame instead of an array: a comparison produces a Series of True/False, and indexing with that Series keeps only the True rows.

high_scorers = df["midterm_score"] > 90
print(f"high scorers: {high_scorers.sum()} of {len(df)} students")

df[high_scorers].head()
high scorers: 31 of 2400 students
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed
70 S0012 2024 Engineering M North Mother False C05 Machine Learning Fall 2024 2024-09-02 23.2 87.3 90.3 55.3 56.4 C True
130 S0022 2022 Data Science F North Mother True C05 Machine Learning Fall 2022 2022-09-05 22.4 63.2 98.7 48.1 56.2 C True
150 S0026 2024 Computer Science M Central Father True C01 Python Programming Fall 2024 2024-09-02 26.7 85.3 90.3 47.6 65.9 C True
159 S0027 2024 Data Science F East Other True C04 Linear Algebra Spring 2025 2025-01-13 26.2 78.0 95.7 70.9 66.6 B True
255 S0043 2024 Information Technology M East Mother True C04 Linear Algebra Spring 2025 2025-01-13 12.8 85.5 94.7 53.8 64.8 C True

Combine conditions with & and |, the same operators and the same parentheses requirement as NumPy, not Python’s and/or:

struggling_without_internet = (df["final_grade"] == "F") & (~df["has_internet"])
print(f"failed students without internet: {struggling_without_internet.sum()}")

df.loc[struggling_without_internet, ["student_id", "final_grade", "has_internet"]].head()
failed students without internet: 60
student_id final_grade has_internet
38 S0007 F False
49 S0009 F False
66 S0012 F False
72 S0013 F False
94 S0016 F False

Activity 3 - Female High Scorers

Goal: Count how many female students (gender == “F”) scored above 80 in final_score.
condition = (df["gender"] == "F") & (df["final_score"] > 80)
condition.sum()
# TODO: count female students with final_score > 80
...

5. Adding, Modifying, and Dropping Columns

Assigning to a new column name creates it; assigning to an existing one overwrites it. Both work the same way and both operate on the whole column at once, no loop required:

df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df[["student_id", "average_marks"]].head()
student_id average_marks
0 S0001 52.766667
1 S0001 60.600000
2 S0001 68.133333
3 S0001 66.466667
4 S0001 56.666667

.drop() removes rows or columns. axis=1 means columns, axis=0 (the default) means rows, the same convention as NumPy’s axis parameter:

# axis=1: drop a column. Returns a NEW DataFrame; df itself is unchanged
# unless you reassign it.
without_internet_col = df.drop("has_internet", axis=1)
print(f"original columns : {len(df.columns)}")
print(f"after drop        : {len(without_internet_col.columns)}")
original columns : 19
after drop        : 18

Common Mistake: Reaching for inplace=True by default

df.drop(“has_internet”, axis=1, inplace=True) looks like it saves memory by not creating a copy. It does not, in most pandas versions it builds the new data and discards the old either way, and it makes code harder to debug: a function that mutates its input instead of returning a new one is a common source of “why did my DataFrame change” bugs. Prefer df = df.drop(…), explicit and easy to trace.

Key Concept: Copy-on-Write is the default in pandas 3

Pandas 3 enables Copy-on-Write (CoW) by default. Every operation on a subset of a DataFrame returns a new independent object instead of a view into the original. Two rules follow directly:

  1. Direct column assignment works fine: df[“average_marks”] = … modifies df through setitem, not through a copy, so it still does what you expect.
  2. Chained assignment is silently dropped: df[df[“gender”] == “F”][“passed”] = True looks like it updates df, but the inner df[mask] now returns a copy. The assignment goes to that temporary copy and is immediately lost. The fix is one step: df.loc[df[“gender”] == “F”, “passed”] = True.

CoW is why inplace=True (above) is being phased out — explicit reassignment (df = df.drop(…)) is both clearer and CoW-compatible. Part 9 shows the same rule applied to groupby and transform.

Activity 4 - Add a Pass/Fail Column

Goal: Add a new column passed that is True when average_marks is at least 0.5, using np.where the same way it was used in the NumPy notebook.
df["passed"] = np.where(df["average_marks"] >= 0.5, True, False)
df["passed"].value_counts()
# TODO: add the "passed" column
...

6. Sorting

.sort_values() sorts by one or more columns; .sort_index() sorts by the index instead. Neither changes df unless you reassign it, same rule as every method so far:

top_5 = df.sort_values("average_marks", ascending=False).head(5)
top_5[["student_id", "average_marks"]]
student_id average_marks
2090 S0349 87.433333
719 S0120 85.933333
918 S0154 85.866667
650 S0109 85.233333
1839 S0307 84.366667
# Sort by multiple columns: gender first, then average_marks descending within each
by_gender_then_marks = df.sort_values(["gender", "average_marks"], ascending=[True, False])
by_gender_then_marks[["student_id", "gender", "average_marks"]].head(6)
student_id gender average_marks
1703 S0284 F 83.800000
2193 S0366 F 81.966667
2247 S0375 F 81.866667
1950 S0326 F 81.766667
491 S0082 F 80.900000
638 S0107 F 80.866667

7. Handling Missing Data

total_toilets and establishment_year are missing for some schools in this dataset, encoded as NaN. .isna() finds them, .dropna() removes rows that have them, and .fillna() replaces them, three different decisions with three different consequences for an analysis.

missing_per_column = df.isna().sum()
print(missing_per_column[missing_per_column > 0])
attendance_pct    73
midterm_score     70
average_marks     70
dtype: int64

Dropping every row with any missing value is the simplest option and often the wrong one: it can throw away far more data than the missing values themselves justify:

print(f"rows before dropna : {len(df)}")
print(f"rows after dropna  : {len(df.dropna())}")
print(f"rows lost          : {len(df) - len(df.dropna())}")
rows before dropna : 2400
rows after dropna  : 2258
rows lost          : 142

Pro Tip: Decide what missing means before you decide how to handle it

A missing total_toilets value might mean the school never reported it, which is different from meaning zero. Filling with the column mean (df[“total_toilets”].fillna(df[“total_toilets”].mean())) assumes the missing schools look like the average school. Dropping those rows assumes they are not worth analysing at all. Both are real decisions with real consequences, not a technicality to get past quickly.

Activity 5 - Fill With the Median

Goal: Fill the missing total_toilets values with the column’s median instead of dropping those rows, and confirm there are no missing values left in that column afterward.
median_toilets = df["total_toilets"].median()
df["total_toilets"] = df["total_toilets"].fillna(median_toilets)
df["total_toilets"].isna().sum()  # -> 0
# TODO: fill missing total_toilets with the column median
...

8. Memory Usage and the Categorical Dtype

A 2,400-row dataset fits in memory with ease, but the patterns here scale to millions of rows where they matter. Understanding what pandas keeps in memory and how to reduce it is a practical skill, not an optimization curiosity.

df.memory_usage(deep=True) shows how many bytes each column occupies. deep=True is required for string columns: without it, pandas reports only the object pointer size, not the actual string bytes.

mem = df.memory_usage(deep=True)
mem_mb = (mem / 1024**2).round(3)
print(mem_mb.sort_values(ascending=False))
print(f"\nTotal: {mem_mb.sum():.3f} MB")
program            0.053
course             0.050
semester           0.041
enrollment_date    0.041
guardian           0.032
region             0.030
student_id         0.030
course_id          0.025
gender             0.021
final_grade        0.021
final_score        0.018
cohort             0.018
average_marks      0.018
study_hours        0.018
midterm_score      0.018
attendance_pct     0.018
project_score      0.018
has_internet       0.002
passed             0.002
Index              0.000
dtype: float64

Total: 0.474 MB

Key Concept: String columns dominate memory in most DataFrames

An object or str column stores a Python string object per row: the string plus Python object overhead. A numeric column stores a flat NumPy array: 8 bytes per row for float64, 4 for float32. A column like program with only 4 distinct values across 2,400 rows wastes most of that memory repeating the same strings. The categorical dtype solves this by storing each unique string once and keeping only a compact integer code per row.

# Compare memory: str dtype vs category dtype
before = df["program"].memory_usage(deep=True)
df["program"] = df["program"].astype("category")
after = df["program"].memory_usage(deep=True)

print(f"str dtype    : {before:,} bytes")
print(f"category     : {after:,} bytes")
print(f"reduction    : {(1 - after / before) * 100:.1f}%")
str dtype    : 55,998 bytes
category     : 2,626 bytes
reduction    : 95.3%
# Any low-cardinality string column is a candidate for category dtype
low_cardinality = [c for c in df.select_dtypes("object").columns if df[c].nunique() < 50]
print("Low-cardinality columns:", low_cardinality)

for col in low_cardinality:
    df[col] = df[col].astype("category")

print(f"Memory after converting all: {df.memory_usage(deep=True).sum() / 1024**2:.3f} MB")
Low-cardinality columns: ['gender', 'region', 'guardian', 'course_id', 'course', 'semester', 'enrollment_date', 'final_grade']
Memory after converting all: 0.184 MB

Pro Tip: Use category dtype early; undo it before one-hot encoding

Convert low-cardinality string columns to category dtype right after loading, before any analysis. It speeds up groupby and value_counts and reduces memory. The one time to undo it: pd.get_dummies() (Part 9) works on category columns but produces the same output, so it is not strictly required, just worth knowing the pattern.

Activity 6 - Measure and Reduce Memory

Goal: Load the dataset fresh, print its total memory usage in MB with memory_usage(deep=True).sum(), convert the gender and guardian columns to category dtype, and print the new total. What percentage of memory did you save?
df_fresh = pd.read_csv("data/university_analytics.csv")
before_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2
df_fresh["gender"] = df_fresh["gender"].astype("category")
df_fresh["guardian"] = df_fresh["guardian"].astype("category")
after_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2
print(f"Before: {before_mb:.3f} MB, After: {after_mb:.3f} MB")
print(f"Saving: {(1 - after_mb/before_mb)*100:.1f}%")
# TODO: measure and reduce memory as described above
...

Capstone: Top Schools Report

Combine everything from this notebook: select, filter, add a column, sort, and handle missing data, to answer one question. Which schools have the highest average student performance, among schools with reasonable internet access?

Capstone Exercise - Top Schools by Performance

Goal:
  1. Fill missing total_toilets values with the column median (Sec. 7)
  2. Filter to students with internet == 1
  3. Group by school_id and compute the mean average_marks per school (a preview of Part 3’s groupby, used here in its simplest one-line form)
  4. Sort the result descending and show the top 5 school IDs
with_internet = df[df["has_internet"] == 1]
school_means = with_internet.groupby("school_id")["average_marks"].mean()
school_means.sort_values(ascending=False).head(5)
# TODO: build the top-5-schools report described above
...

Further Reading

Resource Why it matters
McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly. The book by pandas’ creator; free online at wesmckinney.com/book — Chapters 5 and 6 map directly to this notebook
10 minutes to pandas The official fast-start guide; read it after this notebook to see everything in one place
pandas 3.0 migration guide Covers Copy-on-Write and the new str dtype — the two biggest breaking changes from pandas 2
Harris, C.R. et al. (2020). Array programming with NumPy. Nature 585, 357–362. pandas DataFrames are built on NumPy arrays; understanding arrays makes every pandas index operation predictable

Summary

Concept Key rule
Series A NumPy array with a label for every position
DataFrame A collection of same-length Series sharing one index, one Series per column
pd.read_csv Always follow with .shape, .dtypes, .describe() before analysing anything
df["col"] vs df[["col"]] Single brackets return a Series, double brackets return a DataFrame
.loc Selects by label; slices are inclusive of the end label
.iloc Selects by integer position; slices exclude the end, like a Python list
Boolean filtering (cond1) & (cond2), parentheses required, same as NumPy
Adding columns df["new"] = ... operates on the whole column, no loop needed
inplace=True Avoid it; reassign with df = df.method(...) instead
.sort_values() Pass a list of columns to sort by more than one, with a matching list for ascending
Missing data .isna() to find it, .dropna()/.fillna() to handle it, decide what missing means first
memory_usage(deep=True) Shows true memory cost per column, including string bytes; use before optimising
.astype("category") Stores each unique value once + an integer code per row; cuts memory for low-cardinality columns
Low-cardinality threshold A rule of thumb: columns with fewer than 50 distinct values are worth converting

Next: 09-pandas-operations.ipynb, covering row-wise transformations with apply, string methods, and descriptive statistics on this same dataset.