This notebook assumes you have completed Parts 1-7 (the Python Basics section), especially the NumPy notebook (Part 4): pandas builds directly on NumPy arrays, and a lot of what looks new here is really a NumPy idea wearing a label.
Every example in this part uses a real dataset: exam results for 2,400 students across several schools, with columns covering marks, teacher counts, school size, and whether each student continued or dropped out. Real data means real messiness, missing values included, which is exactly what makes it worth learning on.
Part 9 (09-pandas-operations.ipynb) continues with row-wise transformations, string methods, and descriptive statistics on this same dataset.
NoteTopics covered
Topic
Why it matters
Series and DataFrame
The two objects every pandas operation works on
Reading data
pd.read_csv and the first checks every dataset deserves
Selecting rows and columns
loc vs iloc, and why mixing them up is the most common pandas bug
Boolean filtering
The same masking idea from NumPy, applied to a labelled table
Missing data
Real datasets have gaps; pandas gives you tools to see and handle them, not ignore them
Memory and categorical dtype
Profile what a DataFrame costs in RAM; reduce it for columns with few distinct values
Callout markers used throughout this notebook are explained on the book cover page.
NoteLearning Objectives
By the end of Part 8 you will be able to:
#
Skill
Covered in
1
Explain the relationship between a Series, a DataFrame, and a NumPy array
Sec. 1
2
Load a CSV and run the first checks any new dataset deserves
Sec. 2
3
Select columns and rows correctly with loc and iloc
Sec. 3
4
Filter rows with boolean conditions
Sec. 4
5
Add, modify, and drop columns without the inplace trap
Sec. 5
6
Sort a DataFrame by one or more columns
Sec. 6
7
Find, count, and handle missing values
Sec. 7
8
Profile memory usage and reduce it with the categorical dtype
Sec. 8
0. The Tool That Thinks in Tables
You have a CSV. It has 2,400 rows, 10 columns, and three questions from your manager: which schools have the highest pass rate? Is there a relationship between teacher count and average score? How did results change between semesters?
NumPy can store the numbers. A plain Python dict can hold the column names. But neither of them has a concept of a labelled row, a named column, or a group-by operation. You would spend more time writing plumbing than answering the questions.
pandas (pandas.pydata.org) was created at AQR Capital in 2008 by Wes McKinney to bring the R data frame into Python. The idea was to give Python scientists a two-dimensional, labelled data structure with built-in operations for filtering, grouping, joining, and reshaping — the things you actually do with tabular data, every day. It became the de-facto standard for data analysis in Python and the input format expected by every ML library in the ecosystem.
For most data science and ML work pandas is the right starting point. You will meet Polars in Part 11, when the scenario specifically calls for speed at scale.
Already in your environment
pandas is in pyproject.toml. For a standalone project:
pandas gives you two data structures, and almost everything else in the library is a method on one of them. A Series is a one-dimensional, labelled array: a NumPy array with an index attached. A DataFrame is a two-dimensional table: a collection of Series that all share the same index, one Series per column.
import numpy as npimport pandas as pd# A Series is a NumPy array plus a label for every position.marks = pd.Series([62, 78, 91], index=["s001", "s002", "s003"])print(marks)print(f"underlying NumPy array: {marks.to_numpy()}")
Relationship: Series are columns, DataFrame shares one Index
flowchart TB
subgraph S ["Series (1-D)"]
I1["Index: 0
1
2"] --- V1["Values: 85.0
72.0
91.0"]
LB["name: midterm_score"]
end
subgraph DF ["DataFrame (2-D)"]
direction LR
COL1["Series: midterm_score"]
COL2["Series: final_score"]
COL3["Series: program"]
IDX["shared Index: S0001, S0002, S0003"]
end
S -->|"pd.DataFrame({'mid': s1, 'fin': s2})"| DF
style DF fill:#EAF3FA,stroke:#0369A1
style S fill:#EBF5F0,stroke:#059669
A DataFrame is what you get when several Series that share an index sit next to each other as columns. Building one from a dict of equal-length arrays, the same pattern used to build the feature matrices in the NumPy notebook, makes this explicit:
Key Concept: A DataFrame is a dict of same-length Series
Every column of a DataFrame is a Series sharing the DataFrame’s index. data[“midterm_score”] returns that column as a standalone Series, with the same index, the same values, and every NumPy method still available on it through .to_numpy(). This is why the array skills from the NumPy notebook keep paying off here: a pandas column is a NumPy array with row labels attached, not a different kind of thing.
2. Reading and Inspecting Data
pd.read_csv is almost always the first line of a real analysis. The first thing to do with whatever it returns is never to start analysing it: it is to check what you actually got.
.dtypes shows what type pandas inferred for every column, .info() adds memory usage and non-null counts in one call, and .describe() summarises every numeric column at once:
Pro Tip: Run these three checks on every new dataset, before anything else
.shape, .dtypes, and .describe() take seconds and catch most of the embarrassing mistakes early: a column pandas read as text when it should be numeric, a date stored as a string, a column that is almost entirely one value. Cheaper to catch here than three analysis steps later.
Activity 1 - First Look
Goal: Print the column names with df.columns, then print how many unique values the caste column has using .nunique().
df.columns
df["program"].nunique()
# -> 4
# TODO: print df.columns, then the number of unique values in "program"...
3. Selecting Rows and Columns
df["col"] selects one column as a Series; df[["col1", "col2"]] selects several columns as a DataFrame, note the double brackets. Selecting rows by label or by position needs .loc or .iloc, and the two are not interchangeable.
.loc selects by label: the value in the index, or a column name. .iloc selects by integer position, exactly like a NumPy array, regardless of what the labels actually are:
# .loc: by label. Row label 0, column label "midterm_score".print(df.loc[0, "midterm_score"])# .iloc: by position. First row, second column, whatever they are labelled.print(df.iloc[0, 1])# A label-based row slice with .loc is inclusive of both ends.df.loc[0:2, ["student_id", "midterm_score"]]
46.1
2023
student_id
midterm_score
0
S0001
46.1
1
S0001
49.9
2
S0001
57.3
Common Mistake: Treating .loc and .iloc as interchangeable
df.loc[0:2] and df.iloc[0:2] can return a different number of rows. .iloc slicing behaves exactly like a Python list: the end is excluded. .loc slicing is inclusive of the end label, because that label might not even be an integer; it could be a date or a name. The two only happen to look similar when the index is the default 0, 1, 2, … range, which is exactly when this mistake hides best.
Activity 2 - Select a Slice
Goal: Using .iloc, select rows 10 through 14 (5 rows) and only the student_id and project_score columns.
df.iloc[10:15, [?, ?]]
# Hint: find the integer position of each column with df.columns.get_loc(...)
# TODO: select rows 10-14 and the student_id/project_score columns using .iloc...
4. Boolean Filtering
This is the same trick from the NumPy notebook’s boolean masking section, applied to a DataFrame instead of an array: a comparison produces a Series of True/False, and indexing with that Series keeps only the True rows.
high_scorers = df["midterm_score"] >90print(f"high scorers: {high_scorers.sum()} of {len(df)} students")df[high_scorers].head()
high scorers: 31 of 2400 students
student_id
cohort
program
gender
region
guardian
has_internet
course_id
course
semester
enrollment_date
study_hours
attendance_pct
midterm_score
final_score
project_score
final_grade
passed
70
S0012
2024
Engineering
M
North
Mother
False
C05
Machine Learning
Fall 2024
2024-09-02
23.2
87.3
90.3
55.3
56.4
C
True
130
S0022
2022
Data Science
F
North
Mother
True
C05
Machine Learning
Fall 2022
2022-09-05
22.4
63.2
98.7
48.1
56.2
C
True
150
S0026
2024
Computer Science
M
Central
Father
True
C01
Python Programming
Fall 2024
2024-09-02
26.7
85.3
90.3
47.6
65.9
C
True
159
S0027
2024
Data Science
F
East
Other
True
C04
Linear Algebra
Spring 2025
2025-01-13
26.2
78.0
95.7
70.9
66.6
B
True
255
S0043
2024
Information Technology
M
East
Mother
True
C04
Linear Algebra
Spring 2025
2025-01-13
12.8
85.5
94.7
53.8
64.8
C
True
Combine conditions with & and |, the same operators and the same parentheses requirement as NumPy, not Python’s and/or:
struggling_without_internet = (df["final_grade"] =="F") & (~df["has_internet"])print(f"failed students without internet: {struggling_without_internet.sum()}")df.loc[struggling_without_internet, ["student_id", "final_grade", "has_internet"]].head()
failed students without internet: 60
student_id
final_grade
has_internet
38
S0007
F
False
49
S0009
F
False
66
S0012
F
False
72
S0013
F
False
94
S0016
F
False
Activity 3 - Female High Scorers
Goal: Count how many female students (gender == “F”) scored above 80 in final_score.
# TODO: count female students with final_score > 80...
5. Adding, Modifying, and Dropping Columns
Assigning to a new column name creates it; assigning to an existing one overwrites it. Both work the same way and both operate on the whole column at once, no loop required:
.drop() removes rows or columns. axis=1 means columns, axis=0 (the default) means rows, the same convention as NumPy’s axis parameter:
# axis=1: drop a column. Returns a NEW DataFrame; df itself is unchanged# unless you reassign it.without_internet_col = df.drop("has_internet", axis=1)print(f"original columns : {len(df.columns)}")print(f"after drop : {len(without_internet_col.columns)}")
original columns : 19
after drop : 18
Common Mistake: Reaching for inplace=True by default
df.drop(“has_internet”, axis=1, inplace=True) looks like it saves memory by not creating a copy. It does not, in most pandas versions it builds the new data and discards the old either way, and it makes code harder to debug: a function that mutates its input instead of returning a new one is a common source of “why did my DataFrame change” bugs. Prefer df = df.drop(…), explicit and easy to trace.
Key Concept: Copy-on-Write is the default in pandas 3
Pandas 3 enables Copy-on-Write (CoW) by default. Every operation on a subset of a DataFrame returns a new independent object instead of a view into the original. Two rules follow directly:
Direct column assignment works fine:df[“average_marks”] = … modifies df through setitem, not through a copy, so it still does what you expect.
Chained assignment is silently dropped:df[df[“gender”] == “F”][“passed”] = True looks like it updates df, but the inner df[mask] now returns a copy. The assignment goes to that temporary copy and is immediately lost. The fix is one step: df.loc[df[“gender”] == “F”, “passed”] = True.
CoW is why inplace=True (above) is being phased out — explicit reassignment (df = df.drop(…)) is both clearer and CoW-compatible. Part 9 shows the same rule applied to groupby and transform.
Activity 4 - Add a Pass/Fail Column
Goal: Add a new column passed that is True when average_marks is at least 0.5, using np.where the same way it was used in the NumPy notebook.
.sort_values() sorts by one or more columns; .sort_index() sorts by the index instead. Neither changes df unless you reassign it, same rule as every method so far:
# Sort by multiple columns: gender first, then average_marks descending within eachby_gender_then_marks = df.sort_values(["gender", "average_marks"], ascending=[True, False])by_gender_then_marks[["student_id", "gender", "average_marks"]].head(6)
student_id
gender
average_marks
1703
S0284
F
83.800000
2193
S0366
F
81.966667
2247
S0375
F
81.866667
1950
S0326
F
81.766667
491
S0082
F
80.900000
638
S0107
F
80.866667
7. Handling Missing Data
total_toilets and establishment_year are missing for some schools in this dataset, encoded as NaN. .isna() finds them, .dropna() removes rows that have them, and .fillna() replaces them, three different decisions with three different consequences for an analysis.
Dropping every row with any missing value is the simplest option and often the wrong one: it can throw away far more data than the missing values themselves justify:
print(f"rows before dropna : {len(df)}")print(f"rows after dropna : {len(df.dropna())}")print(f"rows lost : {len(df) -len(df.dropna())}")
rows before dropna : 2400
rows after dropna : 2258
rows lost : 142
Pro Tip: Decide what missing means before you decide how to handle it
A missing total_toilets value might mean the school never reported it, which is different from meaning zero. Filling with the column mean (df[“total_toilets”].fillna(df[“total_toilets”].mean())) assumes the missing schools look like the average school. Dropping those rows assumes they are not worth analysing at all. Both are real decisions with real consequences, not a technicality to get past quickly.
Activity 5 - Fill With the Median
Goal: Fill the missing total_toilets values with the column’s median instead of dropping those rows, and confirm there are no missing values left in that column afterward.
# TODO: fill missing total_toilets with the column median...
8. Memory Usage and the Categorical Dtype
A 2,400-row dataset fits in memory with ease, but the patterns here scale to millions of rows where they matter. Understanding what pandas keeps in memory and how to reduce it is a practical skill, not an optimization curiosity.
df.memory_usage(deep=True) shows how many bytes each column occupies. deep=True is required for string columns: without it, pandas reports only the object pointer size, not the actual string bytes.
mem = df.memory_usage(deep=True)mem_mb = (mem /1024**2).round(3)print(mem_mb.sort_values(ascending=False))print(f"\nTotal: {mem_mb.sum():.3f} MB")
Key Concept: String columns dominate memory in most DataFrames
An object or str column stores a Python string object per row: the string plus Python object overhead. A numeric column stores a flat NumPy array: 8 bytes per row for float64, 4 for float32. A column like program with only 4 distinct values across 2,400 rows wastes most of that memory repeating the same strings. The categorical dtype solves this by storing each unique string once and keeping only a compact integer code per row.
# Any low-cardinality string column is a candidate for category dtypelow_cardinality = [c for c in df.select_dtypes("object").columns if df[c].nunique() <50]print("Low-cardinality columns:", low_cardinality)for col in low_cardinality: df[col] = df[col].astype("category")print(f"Memory after converting all: {df.memory_usage(deep=True).sum() /1024**2:.3f} MB")
Pro Tip: Use category dtype early; undo it before one-hot encoding
Convert low-cardinality string columns to category dtype right after loading, before any analysis. It speeds up groupby and value_counts and reduces memory. The one time to undo it: pd.get_dummies() (Part 9) works on category columns but produces the same output, so it is not strictly required, just worth knowing the pattern.
Activity 6 - Measure and Reduce Memory
Goal: Load the dataset fresh, print its total memory usage in MB with memory_usage(deep=True).sum(), convert the gender and guardian columns to category dtype, and print the new total. What percentage of memory did you save?
# TODO: measure and reduce memory as described above...
Capstone: Top Schools Report
Combine everything from this notebook: select, filter, add a column, sort, and handle missing data, to answer one question. Which schools have the highest average student performance, among schools with reasonable internet access?
Capstone Exercise - Top Schools by Performance
Goal:
Fill missing total_toilets values with the column median (Sec. 7)
Filter to students with internet == 1
Group by school_id and compute the mean average_marks per school (a preview of Part 3’s groupby, used here in its simplest one-line form)
Sort the result descending and show the top 5 school IDs