Chapter 8: Pandas core

You have a CSV. It has 2,400 rows, 10 columns, and three questions from your manager: which schools have the highest pass rate? Is there a relationship between teacher count and average score? How did results change between semesters?

NumPy can store the numbers. A plain Python dict can hold the column names. But neither of them knows what a labelled row is. pandas gives every row an index, every column a name, and your 2,400-row question becomes three lines of code.

pandas builds directly on the arrays from Chapter 7: it wraps that same numerical core with a label for every row and every column. This notebook covers the shaping work: loading a file, inspecting its structure, selecting rows and columns, filtering, and fixing missing values. Chapter 9 (09-pandas-operations.ipynb) builds on this foundation with groupby, merge, and apply: the operations you reach for in every real analysis.

Callout markers: book cover page.

The tool that thinks in tables

pandas (pandas.pydata.org) was created at AQR Capital in 2008 by Wes McKinney to bring the R data frame into Python. The idea was to give Python scientists a two-dimensional, labelled data structure with built-in operations for filtering, grouping, joining, and reshaping: the things you actually do with tabular data, every day. It became the de-facto standard for data analysis in Python and the input format expected by every ML library in the ecosystem.

How it compares to alternatives

Library	Strengths	When to reach for it instead
pandas (pandas.pydata.org)	Mature, enormous ecosystem, readable API	Your default for anything up to a few million rows
Polars (pola.rs)	Rust core, multi-threaded, lazy evaluation	Datasets too large for pandas to fit comfortably in RAM; see Chapter 12
Dask (dask.org)	Distributed pandas on a cluster	Multi-machine computation, out-of-core data
Modin (modin.readthedocs.io)	pandas API, parallel execution	Parallelism without changing your pandas code
cuDF (docs.rapids.ai)	GPU-accelerated pandas	Large datasets on NVIDIA GPUs

For most data science and ML work pandas is the right starting point. You will meet Polars in Chapter 12, when the scenario specifically calls for speed at scale.

Already in your environment

pandas is in pyproject.toml. For a standalone project:

uv add pandas

Official docs: pandas.pydata.org/docs

Learning objectives

By the end of Chapter 8 you will be able to:

#	Skill	Covered in
1	Explain the relationship between a Series, a DataFrame, and a NumPy array	Sec. 1
2	Load a CSV and run the first checks any new dataset deserves	Sec. 2
3	Select columns and rows correctly with `loc` and `iloc`	Sec. 3
4	Filter rows with boolean conditions	Sec. 4
5	Add, modify, and drop columns without the inplace trap	Sec. 5
6	Sort a DataFrame by one or more columns	Sec. 6
7	Find, count, and handle missing values	Sec. 7
8	Profile memory usage and reduce it with the categorical dtype	Sec. 8

1. Series and DataFrame

pandas gives you two data structures, and almost everything else in the library is a method on one of them. A Series is a one-dimensional, labelled array: a NumPy array with an index attached. A DataFrame is a two-dimensional table: a collection of Series that all share the same index, one Series per column.

import numpy as np
import pandas as pd

# A Series is a NumPy array plus a label for every position.
marks = pd.Series([62, 78, 91], index=["s001", "s002", "s003"])
print(marks)
print(f"underlying NumPy array: {marks.to_numpy()}")

s001    62
s002    78
s003    91
dtype: int64
underlying NumPy array: [62 78 91]

Two panels. Left: Series panel with index (0,1,2) and values (85,72,91). Right: DataFrame panel with shared index and four named columns. Arrow between them labeled many Series.

A Series is a single labelled column with an index. A DataFrame is multiple Series sharing the same index, forming a table.

A DataFrame is what you get when several Series that share an index sit next to each other as columns. Building one from a dict of equal-length arrays, the same pattern used to build the feature matrices in the NumPy notebook, makes this explicit:

data = pd.DataFrame(
    {
        "student_id": ["s001", "s002", "s003"],
        "midterm_score": [62, 78, 91],
        "gender": ["F", "M", "F"],
    }
)
data

	student_id	midterm_score	gender
0	s001	62	F
1	s002	78	M
2	s003	91	F

Key Concept: A DataFrame is a dict of same-length Series

Every column of a DataFrame is a Series sharing the DataFrame’s index. data[“midterm_score”] returns that column as a standalone Series, with the same index, the same values, and every NumPy method still available on it through .to_numpy(). This is why the array skills from the NumPy notebook keep paying off here: a pandas column is a NumPy array with row labels attached, not a different kind of thing.

Before pandas, you’d write statistics.mean(scores) to get the mean of a list. With pandas you write df['score'].mean(): the same idea, but it works across any column in a table, handles missing values automatically, and stays readable when you chain it with other operations.

Activity 1 - Build a DataFrame from Two Series

Go from individual Series to a joined DataFrame.

1. Create a Series for three students’ midterm scores with student ID labels:
mid = pd.Series([85, 72, 91], index=[‘s001’,‘s002’,‘s003’], name=‘midterm’)
2. Create a matching Series for final scores:
fin = pd.Series([78, 88, 65], index=[‘s001’,‘s002’,‘s003’], name=‘final’)
3. Combine into a DataFrame: df_two = pd.DataFrame({‘midterm’: mid, ‘final’: fin})
4. Check that both series share the same index: df_two.index
5. Add a third column inline: df_two[‘average’] = (df_two[‘midterm’] + df_two[‘final’]) / 2

Expected: a 3x3 DataFrame with matching index labels.

2. Reading and inspecting data

pd.read_csv is almost always the first line of a real analysis. The first thing to do with whatever it returns is never to start analysing it: it’s to check what you actually got.

df = pd.read_csv("data/university_analytics.csv")
print(f"shape   : {df.shape}")
df.head()

shape   : (2400, 18)

	student_id	cohort	program	gender	region	guardian	has_internet	course_id	course	semester	enrollment_date	study_hours	attendance_pct	midterm_score	final_score	project_score	final_grade	passed
0	S0001	2023	Information Technology	F	South	Father	True	C01	Python Programming	Fall 2023	2023-09-04	10.8	88.6	49.9	58.2	61.6	C	True
1	S0001	2023	Information Technology	F	South	Father	True	C02	Statistics	Spring 2024	2024-01-15	16.1	71.5	48.7	63.7	65.8	C	True
2	S0001	2023	Information Technology	F	South	Father	True	C03	Data Structures	Fall 2023	2023-09-04	23.6	71.5	54.1	60.9	79.8	C	True
3	S0001	2023	Information Technology	F	South	Father	True	C04	Linear Algebra	Spring 2024	2024-01-15	4.7	78.3	71.4	55.3	60.1	C	True
4	S0001	2023	Information Technology	F	South	Father	True	C05	Machine Learning	Fall 2023	2023-09-04	16.2	63.2	48.4	54.2	51.8	D	True

By default pd.read_csv() infers types using NumPy dtypes. Passing dtype_backend='pyarrow' uses Apache Arrow memory instead: string columns stay as ArrowDtype(pa.large_string()) rather than object, nullable integers avoid the forced float conversion, and memory use drops for large CSVs.

df_arrow = pd.read_csv("data/university_analytics.csv", dtype_backend="pyarrow")
print(df_arrow.dtypes)
arrow_kb = df_arrow.memory_usage(deep=True).sum() / 1024
default_kb = df.memory_usage(deep=True).sum() / 1024
print(f"Memory: {arrow_kb:.1f} KB (arrow)  vs  {default_kb:.1f} KB (default)")

student_id         string[pyarrow]
cohort              int64[pyarrow]
program            string[pyarrow]
gender             string[pyarrow]
region             string[pyarrow]
guardian           string[pyarrow]
has_internet         bool[pyarrow]
course_id          string[pyarrow]
course             string[pyarrow]
semester           string[pyarrow]
enrollment_date    string[pyarrow]
study_hours        double[pyarrow]
attendance_pct     double[pyarrow]
midterm_score      double[pyarrow]
final_score        double[pyarrow]
project_score      double[pyarrow]
final_grade        string[pyarrow]
passed               bool[pyarrow]
dtype: object
Memory: 372.0 KB (arrow)  vs  469.2 KB (default)

Key Concept: dtype_backend=‘pyarrow’ changes representation, not the API

dtype_backend=‘pyarrow’ changes the in-memory representation but not the API: every df method still works. The benefit shows at scale: string columns move from Python-object overhead to Arrow’s zero-copy buffers. Use it for any CSV with many text columns or when you will later pass the DataFrame to Polars.

.dtypes shows what type pandas inferred for every column, .info() adds memory usage and non-null counts in one call, and .describe() summarises every numeric column at once:

df.dtypes

student_id             str
cohort               int64
program                str
gender                 str
region                 str
guardian               str
has_internet          bool
course_id              str
course                 str
semester               str
enrollment_date        str
study_hours        float64
attendance_pct     float64
midterm_score      float64
final_score        float64
project_score      float64
final_grade            str
passed                bool
dtype: object

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   student_id       2400 non-null   str    
 1   cohort           2400 non-null   int64  
 2   program          2400 non-null   str    
 3   gender           2400 non-null   str    
 4   region           2400 non-null   str    
 5   guardian         2400 non-null   str    
 6   has_internet     2400 non-null   bool   
 7   course_id        2400 non-null   str    
 8   course           2400 non-null   str    
 9   semester         2400 non-null   str    
 10  enrollment_date  2400 non-null   str    
 11  study_hours      2400 non-null   float64
 12  attendance_pct   2327 non-null   float64
 13  midterm_score    2330 non-null   float64
 14  final_score      2400 non-null   float64
 15  project_score    2400 non-null   float64
 16  final_grade      2400 non-null   str    
 17  passed           2400 non-null   bool   
dtypes: bool(2), float64(5), int64(1), str(10)
memory usage: 469.2 KB

df.describe().T

	count	mean	std	min	25%	50%	75%	max
cohort	2400.0	2023.007500	0.798567	2022.0	2022.00	2023.00	2024.0	2024.0
study_hours	2400.0	18.029333	6.881890	2.0	13.30	17.80	22.6	45.0
attendance_pct	2327.0	75.057542	14.115225	28.2	64.80	75.00	85.2	100.0
midterm_score	2330.0	60.518541	13.455787	15.9	51.20	60.50	70.0	100.0
final_score	2400.0	59.496958	15.348690	10.0	48.90	60.00	70.0	100.0
project_score	2400.0	64.963833	11.096192	18.6	57.65	65.05	72.6	100.0

Pro Tip: Run these three checks on every new dataset, before anything else

.shape, .dtypes, and .describe() take seconds and catch most of the embarrassing mistakes early: a column pandas read as text when it should be numeric, a date stored as a string, a column that is almost entirely one value. Cheaper to catch here than three analysis steps later.

Activity 2 - First Look

Print the column names with df.columns, then print how many unique values the caste column has using .nunique().

df.columns
df["program"].nunique()
# -> 4

# TODO: print df.columns, then the number of unique values in "program"
...

3. Selecting rows and columns

df["col"] selects one column as a Series; df[["col1", "col2"]] selects several columns as a DataFrame, note the double brackets. Selecting rows by label or by position needs .loc or .iloc, and the two are not interchangeable.

Key Concept: loc selects by label; iloc selects by integer position

df.loc[0, ‘midterm_score’] uses the index label 0 and the column name. df.iloc[0, 2] uses integer positions regardless of labels. Never chain them: df[‘col’][row] may silently create a copy and your assignment won’t modify the original DataFrame. Use df.loc[row, ‘col’] = value instead.

one_column = df["midterm_score"]
several_columns = df[["student_id", "midterm_score", "final_score"]]
print(type(one_column), type(several_columns))

<class 'pandas.Series'> <class 'pandas.DataFrame'>

.loc selects by label: the value in the index, or a column name. .iloc selects by integer position, exactly like a NumPy array, regardless of what the labels actually are:

# .loc: by label. Row label 0, column label "midterm_score".
print(df.loc[0, "midterm_score"])

# .iloc: by position. First row, second column, whatever they are labelled.
print(df.iloc[0, 1])

# A label-based row slice with .loc is inclusive of both ends.
df.loc[0:2, ["student_id", "midterm_score"]]

49.9
2023

	student_id	midterm_score
0	S0001	49.9
1	S0001	48.7
2	S0001	54.1

Common Mistake: Treating .loc and .iloc as interchangeable

df.loc[0:2] and df.iloc[0:2] can return a different number of rows. .iloc slicing behaves exactly like a Python list: the end is excluded. .loc slicing is inclusive of the end label, because that label might not even be an integer; it could be a date or a name. The two only happen to look similar when the index is the default 0, 1, 2, … range, which is exactly when this mistake hides best.

Activity 3 - Select a Slice

Using .iloc, select rows 10 through 14 (5 rows) and only the student_id and project_score columns.

df.iloc[10:15, [?, ?]]
# Hint: find the integer position of each column with df.columns.get_loc(...)

# TODO: select rows 10-14 and the student_id/project_score columns using .iloc
...

4. Boolean filtering

This is the same trick from the NumPy notebook’s boolean masking section, applied to a DataFrame instead of an array: a comparison produces a Series of True/False, and indexing with that Series keeps only the True rows.

Key Concept: A boolean mask is just a Series of True/False: combine with & and |

df[‘score’] > 80 returns a Series of True/False with the same index as the DataFrame. df[mask] keeps only the rows where the mask is True. Combine two masks with & (and) or | (or), always wrapped in parentheses: (df[‘score’] > 80) & (df[‘program’] == ‘CS’). Never use Python’s and/or on a Series: they raise a ValueError.

high_scorers = df["midterm_score"] > 90
print(f"high scorers: {high_scorers.sum()} of {len(df)} students")

df[high_scorers].head()

high scorers: 36 of 2400 students

	student_id	cohort	program	gender	region	guardian	has_internet	course_id	course	semester	enrollment_date	study_hours	attendance_pct	midterm_score	final_score	project_score	final_grade	passed
6	S0002	2023	Data Science	M	West	Mother	True	C01	Python Programming	Fall 2023	2023-09-04	17.4	67.0	93.6	60.5	67.0	B	True
130	S0022	2022	Data Science	F	North	Mother	True	C05	Machine Learning	Fall 2022	2022-09-05	22.4	63.2	95.7	45.1	53.2	C	True
150	S0026	2024	Computer Science	M	Central	Father	True	C01	Python Programming	Fall 2024	2024-09-02	26.7	85.3	93.9	51.2	69.5	C	True
159	S0027	2024	Data Science	F	East	Other	True	C04	Linear Algebra	Spring 2025	2025-01-13	26.2	78.0	93.3	68.5	64.2	B	True
255	S0043	2024	Information Technology	M	East	Mother	True	C04	Linear Algebra	Spring 2025	2025-01-13	12.8	85.5	92.3	51.4	62.4	C	True

Combine conditions with & and |, the same operators and the same parentheses requirement as NumPy, not Python’s and/or:

struggling_without_internet = (df["final_grade"] == "F") & (~df["has_internet"])
print(f"failed students without internet: {struggling_without_internet.sum()}")

df.loc[struggling_without_internet, ["student_id", "final_grade", "has_internet"]].head()

failed students without internet: 76

	student_id	final_grade	has_internet
16	S0003	F	False
38	S0007	F	False
40	S0007	F	False
49	S0009	F	False
66	S0012	F	False

Activity 4 - Female High Scorers

Count how many female students (gender == “F”) scored above 80 in final_score.

condition = (df["gender"] == "F") & (df["final_score"] > 80)
condition.sum()

# TODO: count female students with final_score > 80
...

5. Adding, Modifying, and Dropping Columns

Assigning to a new column name creates it; assigning to an existing one overwrites it. Both work the same way and both operate on the whole column at once, no loop required:

df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df[["student_id", "average_marks"]].head()

	student_id	average_marks
0	S0001	56.566667
1	S0001	59.400000
2	S0001	64.933333
3	S0001	62.266667
4	S0001	51.466667

.drop() removes rows or columns. axis=1 means columns, axis=0 (the default) means rows, the same convention as NumPy’s axis parameter:

# axis=1: drop a column. Returns a NEW DataFrame; df itself is unchanged
# unless you reassign it.
without_internet_col = df.drop("has_internet", axis=1)
print(f"original columns : {len(df.columns)}")
print(f"after drop        : {len(without_internet_col.columns)}")

original columns : 19
after drop        : 18

Common Mistake: Reaching for inplace=True by default

df.drop(“has_internet”, axis=1, inplace=True) looks like it saves memory by not creating a copy. It does not, in most pandas versions it builds the new data and discards the old either way, and it makes code harder to debug: a function that mutates its input instead of returning a new one is a common source of “why did my DataFrame change” bugs. Prefer df = df.drop(…), explicit and easy to trace.

Key Concept: Copy-on-Write is the default in pandas 3

Pandas 3 enables Copy-on-Write (CoW) by default. Every operation on a subset of a DataFrame returns a new independent object instead of a view into the original. Two rules follow directly:

Direct column assignment works fine: df[“average_marks”] = … modifies df through setitem, not through a copy, so it still does what you expect.
Chained assignment is silently dropped: df[df[“gender”] == “F”][“passed”] = True looks like it updates df, but the inner df[mask] now returns a copy. The assignment goes to that temporary copy and is immediately lost. The fix is one step: df.loc[df[“gender”] == “F”, “passed”] = True.

CoW is why inplace=True (above) is being phased out: explicit reassignment (df = df.drop(…)) is both clearer and CoW-compatible. Chapter 9 shows the same rule applied to groupby and transform.

Activity 5 - Add a Pass/Fail Column

Add a new column passed that is True when average_marks is at least 0.5, using np.where the same way it was used in the NumPy notebook.

df["passed"] = np.where(df["average_marks"] >= 0.5, True, False)
df["passed"].value_counts()

# TODO: add the "passed" column
...

6. Sorting

.sort_values() sorts by one or more columns; .sort_index() sorts by the index instead. Neither changes df unless you reassign it, same rule as every method so far:

Key Concept: sort_values returns a new DataFrame; display order means nothing about rank

.sort_values() returns a new DataFrame by default. The order rows appear in a DataFrame has no meaning until you sort: the top row isn’t the highest value unless you asked for it. Pass ascending=False for descending order; pass a list of column names to sort by multiple keys.

top_5 = df.sort_values("average_marks", ascending=False).head(5)
top_5[["student_id", "average_marks"]]

	student_id	average_marks
2250	S0376	89.666667
918	S0154	88.866667
1950	S0326	88.566667
2090	S0349	88.033333
719	S0120	87.933333

# Sort by multiple columns: gender first, then average_marks descending within each
by_gender_then_marks = df.sort_values(["gender", "average_marks"], ascending=[True, False])
by_gender_then_marks[["student_id", "gender", "average_marks"]].head(6)

	student_id	gender	average_marks
1950	S0326	F	88.566667
2058	S0344	F	87.000000
491	S0082	F	84.900000
1703	S0284	F	84.066667
474	S0080	F	83.500000
2220	S0371	F	83.400000

Activity 6 - Multi-Column Sort

Sort by two columns and explain what changes.

1. Sort the DataFrame by program ascending, then average_marks descending within each program:
df.sort_values([‘program’, ‘average_marks’], ascending=[True, False]).head(10)
2. Print the original df.head(10): compare the index values.
3. In a markdown cell, answer: did the index values change, or just the row order?
4. Now use .reset_index(drop=True) after sorting and explain what it does.

Expected: rows reordered; original index labels stay attached to their rows until reset.

7. Handling Missing Data

total_toilets and establishment_year are missing for some schools in this dataset, encoded as NaN. .isna() finds them, .dropna() removes rows that have them, and .fillna() replaces them, three different decisions with three different consequences for an analysis.

Key Concept: Missing values are np.nan: detect with isna(), not == None

df.isna() returns a boolean mask of missing positions. df[‘col’].isna().sum() counts how many are missing. fillna(value) replaces them; dropna() removes the rows. Never test x == np.nan: it always returns False because NaN != NaN by definition.

missing_per_column = df.isna().sum()
print(missing_per_column[missing_per_column > 0])

attendance_pct    73
midterm_score     70
average_marks     70
dtype: int64

Dropping every row with any missing value is the simplest option and often the wrong one: it can throw away far more data than the missing values themselves justify:

print(f"rows before dropna : {len(df)}")
print(f"rows after dropna  : {len(df.dropna())}")
print(f"rows lost          : {len(df) - len(df.dropna())}")

rows before dropna : 2400
rows after dropna  : 2258
rows lost          : 142

Pro Tip: Decide what missing means before you decide how to handle it

A missing total_toilets value might mean the school never reported it, which is different from meaning zero. Filling with the column mean (df[“total_toilets”].fillna(df[“total_toilets”].mean())) assumes the missing schools look like the average school. Dropping those rows assumes they are not worth analysing at all. Both are real decisions with real consequences, not a technicality to get past quickly.

Activity 7 - Fill With the Median

Fill the missing total_toilets values with the column’s median instead of dropping those rows, and confirm there are no missing values left in that column afterward.

median_toilets = df["total_toilets"].median()
df["total_toilets"] = df["total_toilets"].fillna(median_toilets)
df["total_toilets"].isna().sum()  # -> 0

# TODO: fill missing total_toilets with the column median
...

8. Memory usage and the categorical dtype

A 2,400-row dataset fits in memory with ease, but the patterns here scale to millions of rows where they matter. Understanding what pandas keeps in memory and how to reduce it’s a practical skill, not an optimization curiosity.

df.memory_usage(deep=True) shows how many bytes each column occupies. deep=True is required for string columns: without it, pandas reports only the object pointer size, not the actual string bytes.

mem = df.memory_usage(deep=True)
mem_mb = (mem / 1024**2).round(3)
print(mem_mb.sort_values(ascending=False))
print(f"\nTotal: {mem_mb.sum():.3f} MB")

program            0.053
course             0.050
semester           0.041
enrollment_date    0.041
guardian           0.032
region             0.030
student_id         0.030
course_id          0.025
gender             0.021
final_grade        0.021
final_score        0.018
cohort             0.018
average_marks      0.018
study_hours        0.018
midterm_score      0.018
attendance_pct     0.018
project_score      0.018
has_internet       0.002
passed             0.002
Index              0.000
dtype: float64

Total: 0.474 MB

Key Concept: String columns dominate memory in most DataFrames

An object or str column stores a Python string object per row: the string plus Python object overhead. A numeric column stores a flat NumPy array: 8 bytes per row for float64, 4 for float32. A column like program with only 4 distinct values across 2,400 rows wastes most of that memory repeating the same strings. The categorical dtype solves this by storing each unique string once and keeping only a compact integer code per row.

# Compare memory: str dtype vs category dtype
before = df["program"].memory_usage(deep=True)
df["program"] = df["program"].astype("category")
after = df["program"].memory_usage(deep=True)

print(f"str dtype    : {before:,} bytes")
print(f"category     : {after:,} bytes")
print(f"reduction    : {(1 - after / before) * 100:.1f}%")

str dtype    : 55,998 bytes
category     : 2,626 bytes
reduction    : 95.3%

# Any low-cardinality string column is a candidate for category dtype
low_cardinality = [c for c in df.select_dtypes("object").columns if df[c].nunique() < 50]
print("Low-cardinality columns:", low_cardinality)

for col in low_cardinality:
    df[col] = df[col].astype("category")

print(f"Memory after converting all: {df.memory_usage(deep=True).sum() / 1024**2:.3f} MB")

Low-cardinality columns: ['gender', 'region', 'guardian', 'course_id', 'course', 'semester', 'enrollment_date', 'final_grade']
Memory after converting all: 0.184 MB

Pro Tip: Use category dtype early; undo it before one-hot encoding

Convert low-cardinality string columns to category dtype right after loading, before any analysis. It speeds up groupby and value_counts and reduces memory. The one time to undo it: pd.get_dummies() (Chapter 9) works on category columns but produces the same output, so it’s not strictly required, just worth knowing the pattern.

Activity 8 - Measure and Reduce Memory

Load the dataset fresh, print its total memory usage in MB with memory_usage(deep=True).sum(), convert the gender and guardian columns to category dtype, and print the new total. What percentage of memory did you save?

df_fresh = pd.read_csv("data/university_analytics.csv")
before_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2
df_fresh["gender"] = df_fresh["gender"].astype("category")
df_fresh["guardian"] = df_fresh["guardian"].astype("category")
after_mb = df_fresh.memory_usage(deep=True).sum() / 1024**2
print(f"Before: {before_mb:.3f} MB, After: {after_mb:.3f} MB")
print(f"Saving: {(1 - after_mb/before_mb)*100:.1f}%")

# TODO: measure and reduce memory as described above
...

Capstone: top schools report

Combine everything from this notebook: select, filter, add a column, sort, and handle missing data, to answer one question. Which schools have the highest average student performance, among schools with reasonable internet access?

Capstone Exercise - Top Schools by Performance

Fill missing total_toilets values with the column median (Sec. 7)
Filter to students with internet == 1
Group by school_id and compute the mean average_marks per school (a preview of Chapter 3’s groupby, used here in its simplest one-line form)
Sort the result descending and show the top 5 school IDs

with_internet = df[df["has_internet"] == 1]
school_means = with_internet.groupby("school_id")["average_marks"].mean()
school_means.sort_values(ascending=False).head(5)

# TODO: build the top-5-schools report described above
...

Resource	Why it matters
McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly.	The book by pandas’ creator; free online at wesmckinney.com/book: Chapters 5 and 6 map directly to this notebook
10 minutes to pandas	The official fast-start guide; read it after this notebook to see everything in one place
pandas 3.0 migration guide	Covers Copy-on-Write and the new `str` dtype: the two biggest breaking changes from pandas 2
Harris, C.R. et al. (2020). Array programming with NumPy. Nature 585, 357–362.	pandas DataFrames are built on NumPy arrays; understanding arrays makes every pandas index operation predictable

Summary

Concept	Key rule
Series	A NumPy array with a label for every position
DataFrame	A collection of same-length Series sharing one index, one Series per column
`pd.read_csv`	Always follow with `.shape`, `.dtypes`, `.describe()` before analysing anything
`df["col"]` vs `df[["col"]]`	Single brackets return a Series, double brackets return a DataFrame
`.loc`	Selects by label; slices are inclusive of the end label
`.iloc`	Selects by integer position; slices exclude the end, like a Python list
Boolean filtering	`(cond1) & (cond2)`, parentheses required, same as NumPy
Adding columns	`df["new"] = ...` operates on the whole column, no loop needed
`inplace=True`	Avoid it; reassign with `df = df.method(...)` instead
`.sort_values()`	Pass a list of columns to sort by more than one, with a matching list for `ascending`
Missing data	`.isna()` to find it, `.dropna()`/`.fillna()` to handle it, decide what missing means first
`memory_usage(deep=True)`	Shows true memory cost per column, including string bytes; use before optimising
`.astype("category")`	Stores each unique value once + an integer code per row; cuts memory for low-cardinality columns
Low-cardinality threshold	A rule of thumb: columns with fewer than 50 distinct values are worth converting

Next: 09-pandas-operations.ipynb, covering row-wise transformations with apply, string methods, and descriptive statistics on this same dataset.