Part 10: Combining, Reshaping & Time Series

Open In Colab Download Notebook

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook continues from Part 9 (09-pandas-operations.ipynb). Real analyses rarely live in one table: a second file with reference data shows up, a result needs splitting by category, and a wide summary table is easier to read than a long one. This part covers four multi-table operations and adds time-indexed data as a fifth theme: concatenation, merging, groupby, pivoting, and working with dates and resample.

Alongside the student exam results, we use a small synthetic table mapping each school_id to a region, and a second dataset of daily attendance records covering five schools over a full school term.

Part 11 (11-polars.ipynb) continues with Polars and the expression API.

Topic Why it matters
pd.concat Stack DataFrames on top of each other, or side by side
pd.merge Combine two tables on a shared key, the same idea as a SQL join
groupby Split a dataset into groups, apply a function to each, combine the results
pivot_table Reshape a long result into a wide, readable summary table
Timestamp and to_datetime Parse text into dates; the building block for all time-indexed work
DatetimeIndex An index made of dates unlocks date-based slicing instead of position-only
resample Change the time granularity of a series: daily to weekly, weekly to monthly

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 10 you will be able to:

# Skill Covered in
1 Stack DataFrames with pd.concat, by row and by column Sec. 1
2 Combine two tables on a shared key with pd.merge, and choose the right how Sec. 2
2b Predict row counts after a merge and diagnose unexpected fan-out Sec. 2
3 Split a dataset into groups and aggregate each one with groupby Sec. 3
4 Reshape a grouped result into a wide summary with pivot_table Sec. 4
5 Parse text into datetime values with to_datetime and extract date components Sec. 5
6 Build a DatetimeIndex and use it to slice a time series by date Sec. 6
7 Select rows with a partial date string or a date range Sec. 7
8 Change the time granularity of a series with resample Sec. 8
import numpy as np
import pandas as pd

df = pd.read_csv("data/university_analytics.csv")
df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3

# courses.csv is a thin reference table — one row per course — useful for joins
courses = pd.read_csv("data/courses.csv")
df.head()
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed average_marks
0 S0001 2023 Information Technology F South Father True C01 Python Programming Fall 2023 2023-09-04 10.8 88.6 46.1 54.4 57.8 D True 52.766667
1 S0001 2023 Information Technology F South Father True C02 Statistics Spring 2024 2024-01-15 16.1 71.5 49.9 64.9 67.0 C True 60.600000
2 S0001 2023 Information Technology F South Father True C03 Data Structures Fall 2023 2023-09-04 23.6 71.5 57.3 64.1 83.0 C True 68.133333
3 S0001 2023 Information Technology F South Father True C04 Linear Algebra Spring 2024 2024-01-15 4.7 78.3 75.6 59.5 64.3 C True 66.466667
4 S0001 2023 Information Technology F South Father True C05 Machine Learning Fall 2023 2023-09-04 16.2 63.2 53.6 59.4 57.0 C True 56.666667

1. Concatenating DataFrames with pd.concat

pd.concat stacks DataFrames together. By default it stacks rows on top of each other (axis=0), the operation you need when the same columns show up in two separate files or two separate batches:

male_students = df[df["gender"] == "M"]
female_students = df[df["gender"] == "F"]

recombined = pd.concat([male_students, female_students], axis=0)
print(f"male: {len(male_students)}, female: {len(female_students)}, combined: {len(recombined)}")
male: 1128, female: 1170, combined: 2298

Passing axis=1 stacks columns side by side instead, matching rows up by index. This is the shape you get when a calculation is done separately and needs joining back onto the original table:

pass_fail = (df["average_marks"] >= 0.6).rename("passed")
with_pass_fail = pd.concat([df[["student_id", "average_marks"]], pass_fail], axis=1)
with_pass_fail.head()
student_id average_marks passed
0 S0001 52.766667 True
1 S0001 60.600000 True
2 S0001 68.133333 True
3 S0001 66.466667 True
4 S0001 56.666667 True

Common Mistake: Concatenating by column without matching indexes first

pd.concat([df_a, df_b], axis=1) lines rows up by index position, not by any shared key. If df_a and df_b were filtered, sorted, or reset differently beforehand, row 0 in one might not be row 0 in the other, and the result silently combines the wrong rows. pd.merge (Sec. 2) is the safer choice whenever there is an actual key column to join on.

Activity 1 - Split and Recombine

Goal: Split df into two DataFrames by caste: one for “BC” and one for everything else, using .isin() or a boolean mask. Concatenate them back with pd.concat and confirm the row count matches the original.
bc_only = df[df["program"] == "BC"]
not_bc = df[df["program"] != "BC"]
recombined = pd.concat([bc_only, not_bc], axis=0)
len(recombined) == len(df)
# TODO: split df by program == "Data Science", concat back, and confirm the row count matches
...

2. Joining Data with pd.merge

pd.merge combines two tables on a shared key, the same idea as a SQL join. The courses table loaded above has one row per course_id; merging it onto df attaches metadata (credits, department, instructor) to every enrollment row:

df_with_meta = pd.merge(df, courses, on="course_id", how="left")
df_with_meta[["student_id", "course_id", "course_x", "department", "credits"]].head()
student_id course_id course_x department credits
0 S0001 C01 Python Programming Computing 3
1 S0001 C02 Statistics Mathematics 3
2 S0001 C03 Data Structures Computing 4
3 S0001 C04 Linear Algebra Mathematics 3
4 S0001 C05 Machine Learning Computing 4

Join types: which rows survive each merge strategy

flowchart LR
    subgraph LEFT ["Left: students"]
        LR1["S001 | Alice"]
        LR2["S002 | Bob"]
        LR3["S003 | Carol"]
    end
    subgraph RIGHT ["Right: grades"]
        RR1["S001 | 85"]
        RR2["S002 | 72"]
        RR4["S004 | 90"]
    end
    LR1 & RR1 --> INN["inner: S001 (match only)"]
    LR2 & RR2 --> INN2["inner: S002 (match only)"]
    LR1 & LR2 & LR3 --> LEFT2["left: S001, S002, S003
(Carol grade = NaN)"]
    RR1 & RR2 & RR4 --> RIGHT2["right: S001, S002, S004
(Alice/Bob keep grade, S004 name = NaN)"]
    LR1 & LR2 & LR3 & RR4 --> OUTER["outer: all 4 students, NaN where no match"]

    style INN fill:#EBF5F0,stroke:#059669,color:#065F46
    style INN2 fill:#EBF5F0,stroke:#059669,color:#065F46
    style LEFT2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style RIGHT2 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764
    style OUTER fill:#FEF2F2,stroke:#DC2626,color:#991B1B

The how argument decides which rows survive when a key on one side has no match on the other. courses here has a row for every course_id in df, so every how gives the same result; the difference only shows up once the two tables disagree:

Example: When how actually changes the result

Merging df against only half of courses with how=“inner” keeps only enrollments whose course is in that half. The same merge with how=“left” keeps every student, with region set to NaN wherever the school was missing from the smaller table.

half_courses = courses[courses["department"] == "Computing"]

inner_result = pd.merge(df, half_courses, on="course_id", how="inner")
left_result = pd.merge(df, half_courses, on="course_id", how="left")
missing = left_result["department"].isna().sum()
print(f"inner: {len(inner_result)} rows, left: {len(left_result)} rows, unmatched dept: {missing}")
inner: 1600 rows, left: 2400 rows, unmatched dept: 800

The four join types differ only in which rows they keep, never in how rows are matched:

from ark.plot.diagrams import merge_join_types_diagram

merge_join_types_diagram();

Common Mistake: Forgetting that an unmatched key produces NaN, not a dropped row

With how=“left”, an enrollment whose course_id has no match in courses still gets a row, just with NaN in every column that came from courses. Code that assumes every row has a real region after a left merge will silently miscount or mis-group those rows later, usually in a groupby a few cells down.

Duplicate keys: when merge produces more rows than you expect

Every example so far used a key where the right table had at most one match per left row (many-to-one). When the right table has more than one match, the result has more rows than either input, and this is correct but often surprising.

Concrete example: if a student is enrolled in three courses, a left merge of the enrollment table against a course details table produces three rows for that student, one per course. That is the right answer, but code that assumed row counts stay the same will silently operate on a bigger table.

# Create a small example: two students, one appearing in two courses
students_small = pd.DataFrame(
    {
        "student_id": ["S0001", "S0001", "S0002"],
        "course_id": ["C01", "C02", "C01"],
        "score": [85, 72, 91],
    }
)

course_info = pd.DataFrame(
    {
        "course_id": ["C01", "C02"],
        "course_name": ["Statistics", "Algorithms"],
        "credits": [3, 4],
    }
)

merged_dup = pd.merge(students_small, course_info, on="course_id", how="left")
print(f"students_small rows: {len(students_small)}")
print(f"after merge rows   : {len(merged_dup)}")
merged_dup
students_small rows: 3
after merge rows   : 3
student_id course_id score course_name credits
0 S0001 C01 85 Statistics 3
1 S0001 C02 72 Algorithms 4
2 S0002 C01 91 Statistics 3

Key Concept: A many-to-one merge preserves row count; a one-to-many merge multiplies it

Many-to-one (the common case): each row on the left matches at most one row on the right. Row count stays the same.

One-to-many: one row on the left matches several rows on the right. Each match becomes its own output row. A student enrolled in 3 courses becomes 3 rows after merging on a course-detail table.

Many-to-many: both sides have duplicates on the key. Every left row pairs with every matching right row: 3 left rows matching 4 right rows produces 12 output rows. This is almost never what you want and is usually a sign of a design error in the data or the query.

# Verify: check how many rows the real df gains after merging on course_id
before = len(df)
after = len(pd.merge(df, courses, on="course_id", how="left"))
print(f"Before merge: {before} rows")
print(f"After  merge: {after} rows")
print(f"Difference  : {after - before} extra rows (students with multiple course matches)")

# Diagnose: how many course_ids appear more than once in courses?
dupe_courses = courses[courses.duplicated("course_id", keep=False)]
print(f"\nDuplicate course_ids in courses.csv: {dupe_courses['course_id'].nunique()}")
Before merge: 2400 rows
After  merge: 2400 rows
Difference  : 0 extra rows (students with multiple course matches)

Duplicate course_ids in courses.csv: 0
Common Mistake: Summing a column after an unexpected fan-out

If df has one row per student and groupby(“student_id”)[“score”].sum() gives the right answer before a merge, but the merge doubles some rows, the same sum() after the merge silently counts those scores twice. The most reliable diagnostic is to compare len(df) before and after any merge. A row count increase without a planned one-to-many join is a bug, not a feature.
# Always verify after merging
before = len(df)
merged = pd.merge(df, other_table, on="key", how="left")
assert len(merged) == before, f"Unexpected fan-out: {before} -> {len(merged)} rows"

Activity 2 - Count Enrollments Per Department

Goal: Merge df with courses using how=“left”, then use .value_counts() on the resulting department column.egion column to see how many student rows fall in each region.
merged = pd.merge(df, courses, on="course_id", how="left")
merged["department"].value_counts()
# TODO: merge df with courses (how="left"), then value_counts on department
...

3. Group By Operations

groupby splits a dataset into groups by a key, applies a function to each group independently, and combines the results back into one table, one row per group:

from ark.plot.diagrams import groupby_split_apply_combine_diagram

groupby_split_apply_combine_diagram();

df.groupby("program")["average_marks"].mean()
program
Computer Science          62.001468
Data Science              63.685264
Engineering               60.603548
Information Technology    59.920338
Name: average_marks, dtype: float64

Key Concept: groupby computes nothing until you aggregate

df.groupby(“program”) on its own returns a DataFrameGroupBy object, a plan for splitting the data, not a result. Nothing is actually computed until a method like .mean(), .sum(), or .agg() is called on it, the same lazy-then-compute pattern you will see again in Part 11’s Polars notebook.

.agg() works after groupby exactly as it did in Part 2, computing several statistics per group in one call:

df.groupby("program")["average_marks"].agg(["mean", "std", "count"])
mean std count
program
Computer Science 62.001468 8.084360 704
Data Science 63.685264 8.457177 803
Engineering 60.603548 7.907489 310
Information Technology 59.920338 8.394286 513

Grouping by more than one column splits into one group per combination, here gender within caste:

df.groupby(["program", "gender"])["average_marks"].mean()
program                 gender
Computer Science        F         61.961067
                        M         61.872101
                        Other     63.088652
Data Science            F         63.518499
                        M         63.971014
                        Other     62.415385
Engineering             F         61.998272
                        M         59.527619
Information Technology  F         59.831481
                        M         60.243373
                        Other     55.083333
Name: average_marks, dtype: float64

Pro Tip: observed=True when grouping a category column

Grouping a column with category dtype (Part 2, Sec. 2) by default includes every category the dtype knows about, even ones with zero rows in the current data, padding the result with empty groups. Passing observed=True to groupby keeps only the categories that actually appear.

Activity 3 - Dropout Rate by Caste

Goal: Group df by caste and compute, for each group, what fraction of students have continue_drop == “drop”. (series == “drop”).mean() gives a fraction directly, no separate count and divide needed.
df.groupby("program")["continue_drop"].apply(lambda s: (s == "drop").mean())
# TODO: group by program, compute the pass fraction per program
...

4. Pivoting Data

pivot_table is groupby plus a reshape, in one call: it groups by one column, splits further by another, and lays the second grouping out as columns instead of stacking it into more rows, much easier to scan as a summary:

pd.pivot_table(df, index="program", columns="gender", values="average_marks", aggfunc="mean")
gender F M Other
program
Computer Science 61.961067 61.872101 63.088652
Data Science 63.518499 63.971014 62.415385
Engineering 61.998272 59.527619 NaN
Information Technology 59.831481 60.243373 55.083333

Key Concept: pivot_table is groupby with a different output shape

df.groupby([“program”, “gender”])[“average_marks”].mean() from Sec. 3 and the pivot_table call above compute the exact same numbers. groupby returns them stacked into a long Series with a two-level index; pivot_table lays the second key out across columns instead. Reach for pivot_table specifically when the result is meant to be read by a person, not processed further by code.

aggfunc accepts a list too, giving more than one statistic per cell:

pd.pivot_table(df, index="program", columns="gender", values="average_marks", aggfunc=["mean", "count"])
mean count
gender F M Other F M Other
program
Computer Science 61.961067 61.872101 63.088652 381.0 276.0 47.0
Data Science 63.518499 63.971014 62.415385 373.0 391.0 39.0
Engineering 61.998272 59.527619 NaN 135.0 175.0 NaN
Information Technology 59.831481 60.243373 55.083333 252.0 249.0 12.0

Activity 4 - Region by Program Pivot

Goal: Merge df with courses (Sec. 2), then build a pivot table with region as the index, program as the columns, and the mean average_marks in each cell.
merged = pd.merge(df, courses, on="course_id", how="left")
pd.pivot_table(merged, index="region", columns="program", values="average_marks", aggfunc="mean")
# TODO: merge with courses, then pivot region x program on mean final_score
...

Capstone: Regional Performance Report

Combine every operation from this notebook into one report: a merge to bring in region, a groupby to summarize, and a pivot to lay the summary out for reading.

Capstone Exercise - Regional Performance Report

Goal:
  1. Merge df with courses on course_id, keeping every student (Sec. 2)
  2. Group the merged table by region and compute the mean average_marks and the drop fraction, using the function from Activity 3 (Sec. 3)
  3. Build a pivot table with region as rows and caste as columns, mean average_marks in each cell (Sec. 4)
merged = pd.merge(df, courses, on="course_id", how="left")

regional_summary = merged.groupby("region").agg(
    mean_marks=("average_marks", "mean"),
    drop_fraction=("continue_drop", lambda s: (s == "drop").mean()),
)

region_caste_pivot = pd.pivot_table(merged, index="region", columns="program", values="average_marks", aggfunc="mean")
# TODO: build the regional performance report described above
...

5. Time Series Data

The student exam results dataset has no date column. The attendance dataset does: daily records for five schools across a full school term. Loading it alongside the exam data gives Part 10 two tables to work with and shows how the same pandas idioms scale to a new problem shape.

attendance = pd.read_csv("data/daily_attendance.csv")
attendance.dtypes
date                   str
school_id            int64
attendance_rate    float64
dtype: object

5. Date and Time Data Types

The date column above read in as plain text, str dtype, not a date pandas can do arithmetic on. pd.to_datetime converts it to pandas’ dedicated datetime dtype, datetime64:

attendance["date"] = pd.to_datetime(attendance["date"])
attendance.dtypes
date               datetime64[us]
school_id                   int64
attendance_rate           float64
dtype: object

Key Concept: Pandas 3 infers the resolution it needs, not always nanoseconds

Earlier pandas versions always stored datetimes as datetime64[ns], nanosecond precision, whether the data needed it or not. Pandas 3’s to_datetime infers a resolution from what is actually in the data: day-level strings like the ones here become datetime64[s] or coarser, not nanoseconds. attendance[“date”].dtype shows whichever resolution was inferred for this column.

A single value out of a datetime column is a Timestamp, pandas’ equivalent of Python’s datetime.datetime, with the same year, month, day, and weekday attributes:

first_day = attendance["date"].iloc[0]
print(type(first_day))
print(f"year={first_day.year}, month={first_day.month}, day_name={first_day.day_name()}")
<class 'pandas.Timestamp'>
year=2025, month=1, day_name=Monday

Activity 5 - Parse and Inspect

Goal: Convert a list of three date strings, [“2025-01-06”, “2025-02-14”, “2025-03-28”], to Timestamp values with pd.to_datetime, then print the day name of each one.
dates = pd.to_datetime(["2025-01-06", "2025-02-14", "2025-03-28"])
for d in dates:
    print(d.day_name())
# TODO: convert the three date strings and print each day name
...

6. The DatetimeIndex

Setting a datetime column as the index turns it into a DatetimeIndex, which unlocks date-based slicing instead of only position-based or exact-label lookups. Each school’s rows are pulled out first, since a DatetimeIndex only makes sense for one time series at a time:

school_300 = attendance[attendance["school_id"] == 300].set_index("date")
school_300.index
DatetimeIndex(['2025-01-06', '2025-01-07', '2025-01-08', '2025-01-09',
               '2025-01-10', '2025-01-13', '2025-01-14', '2025-01-15',
               '2025-01-16', '2025-01-17', '2025-01-20', '2025-01-21',
               '2025-01-22', '2025-01-23', '2025-01-24', '2025-01-27',
               '2025-01-28', '2025-01-29', '2025-01-30', '2025-01-31',
               '2025-02-03', '2025-02-04', '2025-02-05', '2025-02-06',
               '2025-02-07', '2025-02-10', '2025-02-11', '2025-02-12',
               '2025-02-13', '2025-02-14', '2025-02-17', '2025-02-18',
               '2025-02-19', '2025-02-20', '2025-02-21', '2025-02-24',
               '2025-02-25', '2025-02-26', '2025-02-27', '2025-02-28',
               '2025-03-03', '2025-03-04', '2025-03-05', '2025-03-06',
               '2025-03-07', '2025-03-10', '2025-03-11', '2025-03-12',
               '2025-03-13', '2025-03-14', '2025-03-17', '2025-03-18',
               '2025-03-19', '2025-03-20', '2025-03-21', '2025-03-24',
               '2025-03-25', '2025-03-26', '2025-03-27', '2025-03-28'],
              dtype='datetime64[us]', name='date', freq=None)

Common Mistake: Setting the index before filtering to one entity

attendance.set_index(“date”) on the full table produces a DatetimeIndex with the same date repeated once per school, since every school has a row for every day. Slicing that index by date then returns rows from every school mixed together for that date, not a clean single time series. Filter to one entity first, exactly as school_300 does above, then set the index.

Activity 6 - Build Another School’s Series

Goal: Filter attendance to school_id == 302, set date as the index, and confirm the result’s index is a DatetimeIndex with isinstance(result.index, pd.DatetimeIndex).
school_302 = attendance[attendance["school_id"] == 302].set_index("date")
isinstance(school_302.index, pd.DatetimeIndex)
# TODO: filter to school_id 302, set date as index, confirm DatetimeIndex
...

7. Selecting Data from a Time Series

A DatetimeIndex accepts a partial date string in .loc, matching every row that falls inside it. "2025-02" selects the whole month without spelling out the first and last day:

school_300.loc["2025-02"].head()
school_id attendance_rate
date
2025-02-03 300 0.885
2025-02-04 300 0.942
2025-02-05 300 0.900
2025-02-06 300 0.892
2025-02-07 300 0.895

A slice with two partial dates selects everything between them, inclusive of both ends:

school_300.loc["2025-02-01":"2025-02-07"]
school_id attendance_rate
date
2025-02-03 300 0.885
2025-02-04 300 0.942
2025-02-05 300 0.900
2025-02-06 300 0.892
2025-02-07 300 0.895

Example: Comparing the size of two date ranges

len(school_300.loc[“2025-01”]) against len(school_300.loc[“2025-02”]) confirms the row count for each month matches its number of business days, the same bdate_range weekday-only pattern used to build this dataset in the first place.

print(f"January business days  : {len(school_300.loc['2025-01'])}")
print(f"February business days : {len(school_300.loc['2025-02'])}")
January business days  : 20
February business days : 20

Activity 7 - Filter the Last Two Weeks of Term

Goal: Select every row in school_300 from “2025-03-15” to “2025-03-28” inclusive, and print the mean attendance_rate over that range.
last_two_weeks = school_300.loc["2025-03-15":"2025-03-28"]
last_two_weeks["attendance_rate"].mean()
# TODO: select 2025-03-15 through 2025-03-28 and print the mean attendance_rate
...

8. The Power of Pandas: resample

resample changes the time granularity of a series: daily data summarized into weekly or monthly figures, the same split-apply-combine idea from Part 3’s groupby, except the groups are time intervals instead of category values:

weekly_attendance = school_300["attendance_rate"].resample("W").mean()
weekly_attendance.head()
date
2025-01-12    0.8896
2025-01-19    0.9044
2025-01-26    0.9144
2025-02-02    0.9054
2025-02-09    0.9028
Freq: W-SUN, Name: attendance_rate, dtype: float64

Key Concept: resample groups by time interval, groupby groups by value

df.groupby(“program”) (Part 3) splits rows by whatever value is already in the caste column. series.resample(“W”) splits rows by which week their DatetimeIndex label falls into, intervals that did not exist as a column at all until resample created them. Both still end with an aggregation like .mean() to combine each group into one number.

Monthly resampling on the same series needs only a different frequency string. In pandas 3 the old "M" alias was removed; the replacement is "ME" (month-end), which anchors each bucket to the last calendar day of the month:

monthly_attendance = school_300["attendance_rate"].resample("ME").mean()
monthly_attendance
date
2025-01-31    0.90345
2025-02-28    0.90985
2025-03-31    0.86510
Freq: ME, Name: attendance_rate, dtype: float64

Pro Tip: Resampling the whole DataFrame keeps every entity separate, with care

attendance.set_index(“date”).groupby(“school_id”)[“attendance_rate”].resample(“W”).mean() resamples each school’s series independently in one call, instead of looping over schools and resampling each one by hand. groupby before resample is what keeps the schools from being averaged together.

Activity 8 - Monthly Comparison Across Schools

Goal: Set date as the index on the full attendance table, group by school_id, and resample to monthly means in one chained call. Use “ME” (month-end) — the pandas 3 replacement for the old “M” alias.
attendance.set_index("date").groupby("school_id")["attendance_rate"].resample("ME").mean()
# TODO: set date as index, group by school_id, resample monthly, take the mean
...

Capstone: Term-End Attendance Report

Combine every operation from this notebook: parsing dates, building a per-school time series, slicing by date, and resampling, into one short report comparing the start and end of term.

Capstone Exercise - Term-End Attendance Report

Goal:
  1. Set date as the index on the full attendance table (Sec. 2)
  2. Group by school_id and resample to weekly means (Sec. 4)
  3. From the result, select the first week of January and the last week of March for every school, using a partial date string (Sec. 3)
  4. Report which school had the largest drop in attendance between those two weeks
weekly = attendance.set_index("date").groupby("school_id")["attendance_rate"].resample("W").mean()

first_week = weekly.loc[:, "2025-01-06":"2025-01-12"]
last_week = weekly.loc[:, "2025-03-24":"2025-03-28"]
drop_per_school = first_week.groupby("school_id").mean() - last_week.groupby("school_id").mean()
drop_per_school.sort_values(ascending=False)
# TODO: build the term-end attendance report described above
...

Further Reading

Resource Why it matters
McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly. Chapter 8 (data wrangling) and Chapter 10 (aggregation) are the canonical references for concat, merge, and groupby
pandas documentation — Merge, join, concatenate and compare Full API reference with worked examples for every how= variant and join edge-case
Wickham, H. (2014). Tidy data. Journal of Statistical Software 59(10). Free PDF — defines when data is “tidy” and explains .melt() / .pivot() as transformations toward tidy form
pandas documentation — Reshaping and pivot tables .pivot_table(), .stack(), .unstack(), and .crosstab() in one place

Summary

Concept Key rule
pd.concat(axis=0) Stack rows; use when the same columns appear in separate batches
pd.concat(axis=1) Stack columns by index position, not by key; prefer merge when there is a real key
pd.merge(..., on=key, how=...) Combine two tables on a shared key, the same idea as a SQL join
how="inner" Keep only rows with a match on both sides
how="left" Keep every row from the left table; unmatched rows get NaN
Many-to-one merge Row count stays the same; each left row matches at most one right row
One-to-many merge Row count grows; each left row can match multiple right rows
Fan-out check Always compare len(before) vs len(after) when a row count increase is not expected
groupby(...) Returns a plan, not a result; nothing computes until you aggregate
.agg([...]) after groupby Several statistics per group, in one call
pivot_table groupby with the second key laid out as columns instead of stacked as rows
pd.to_datetime Parses text into pandas’ datetime64 dtype; pandas 3 infers the resolution
Timestamp A single datetime value, with .year, .month, .day_name(), and similar attributes
DatetimeIndex Set a datetime column as the index to unlock date-based slicing
Filter before indexing Set a DatetimeIndex on one entity’s rows, not a table mixing several entities
.loc["2025-02"] A partial date string selects every row inside that period
.loc[start:end] A date range slice is inclusive of both ends
.resample(freq) Groups rows by time interval instead of by value, then needs an aggregation like .mean()
groupby(...).resample(...) Resample each entity’s series independently in one chained call

Next: 11-polars.ipynb, covering Polars’ DataFrame, expression API, and lazy evaluation.