This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the university_analytics.csv dataset and the ark.plot.gt_style module introduced alongside the plotting chapters.
A plain df.head() output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (great_tables) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.
Callout markers used throughout this notebook are explained on the book cover page.
NoteLearning Objectives
By the end of Part 12 you will be able to:
#
Skill
Covered in
1
Explain when a table is the right choice over a chart
Sec. 1
2
Wrap a DataFrame with GT() and apply the project brand with themed_gt()
Sec. 2
3
Format numbers, percentages, and missing values with fmt_* methods
Sec. 3
4
Add readable column labels with cols_label and group related columns with tab_spanner
Sec. 4
5
Target cells with loc and apply styling with tab_style
Sec. 5
6
Add grand summary rows with grand_summary_rows
Sec. 6
7
Build a model comparison table with metrics_report()
Sec. 7
from great_tables import GT, loc, md as gt_md, styleimport numpy as npimport pandas as pdfrom ark.plot.gt_style import metrics_report, themed_gtfrom ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTEDdf = pd.read_csv("data/university_analytics.csv")df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) /3df.head(3)
student_id
cohort
program
gender
region
guardian
has_internet
course_id
course
semester
enrollment_date
study_hours
attendance_pct
midterm_score
final_score
project_score
final_grade
passed
average_marks
0
S0001
2023
Information Technology
F
South
Father
True
C01
Python Programming
Fall 2023
2023-09-04
10.8
88.6
46.1
54.4
57.8
D
True
52.766667
1
S0001
2023
Information Technology
F
South
Father
True
C02
Statistics
Spring 2024
2024-01-15
16.1
71.5
49.9
64.9
67.0
C
True
60.600000
2
S0001
2023
Information Technology
F
South
Father
True
C03
Data Structures
Fall 2023
2023-09-04
23.6
71.5
57.3
64.1
83.0
C
True
68.133333
0. The Last Mile of a Data Story
You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call df.head(), and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.
df.head() serves you in a notebook. It does not serve anyone else.
The gap between “I have the result” and “I can communicate the result” is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. Great Tables (posit-dev.github.io/great-tables) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R’s {gt} package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.
A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: which course has the highest midterm average? or by how many points does one program outperform another?
Use a table when: - Readers will look up a specific row or compare two exact values - The differences between groups are small and a chart would compress them into noise - A report or stakeholder document needs a citable number, not an impression
Use a chart when: - You want to show a trend, a distribution, or a relationship across many data points - The pattern matters more than the individual values
Neither replaces the other. A data storytelling section (Part 7) shows a trend with a chart. A summary report shows the underlying numbers in a table. The combination answers both the what happened and the by exactly how much.
2. Your First Styled Table
Every Great Tables workflow starts with GT(df): wrapping a pandas DataFrame in the Great Tables object. From there you chain methods to add structure and styling. On its own, GT(df) renders a minimal unstyled table. themed_gt() applies the project’s brand: column header background, font, border colours, and alternating row stripes: one call at the end of the chain.
The first example is a summary of mean scores by gender:
GT(df) alone already renders a table, but column names are raw and values have no formatting. Wrapping it in themed_gt() applies the brand while .tab_header() adds a title and subtitle:
table = ( GT(summary) .tab_header( title=gt_md("**Mean Exam Scores by Gender**"), subtitle="Students with complete score records across all three components", ) .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows"))themed_gt(table, n_rows=len(summary))
Mean Exam Scores by Gender
Students with complete score records across all three components
gender
n_students
midterm
final
project
fail_rate
F
1170
60.85
59.72
65.63
0.04
M
1128
60.93
59.71
65.07
0.04
Other
102
59.86
62.15
63.57
0.03
Source: DS-MLOps university analytics dataset · 2,400 rows
Key Concept: The chain always ends with themed_gt()
themed_gt() applies brand-wide options (tab_options) and text styling. Call it last, after all structural methods (tab_header, cols_label, tab_spanner, etc.) so it can apply consistently across everything you have added.
Activity 1 - First Styled Table
Goal: Group df by program instead of gender, compute the same five aggregates, then wrap with GT and themed_gt. Add a title that identifies the program.
# TODO: build program_summary and display with GT + themed_gt...
3. Formatting Values and Labelling Columns
Raw floats in a table communicate false precision: a pass rate of 0.87654 signals noise, not information. Great Tables fmt_* methods format each column’s values to the right precision for its type, and cols_label replaces machine-readable column names with reader-facing ones.
The four formatting methods used most in DS tables: - fmt_number(columns, decimals): round to decimals places - fmt_integer(columns): strip decimal point, add thousands separator - fmt_percent(columns, decimals): multiply by 100 and append % - fmt_missing(columns, missing_text): replace NaN with a readable label
Example: fmt_percent turns 0.913 into 91.3%
Without formatting, fail_rate=0.04 reads as a raw proportion. With fmt_percent(columns=‘fail_rate’, decimals=1), the same cell displays as 4.0%: the reader does not need to mentally multiply by 100.
Source: DS-MLOps university analytics dataset · 2,400 rows
Pro Tip: fmt_missing catches the NaN before the reader sees it
Any column that can contain NaN: a score column with ~3% missing, an optional field: should have fmt_missing(columns=…, missing_text=“:”) added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?
Activity 2 - Format the Program Table
Goal: Take the program_summary from Activity 1 and add cols_label, fmt_integer, fmt_number, and fmt_percent to match the formatted table above.
# TODO: add cols_label and fmt_* to your program_summary table...
4. Column Spanners
When a table has several columns that belong to a natural group, for example three score columns or multiple model metrics, a column spanner adds a shared header label above the group. This reduces cognitive load: the reader understands the table structure before reading the individual values.
tab_spanner(label, columns) draws the label above the specified columns. It does not move or reorder columns; it only adds a visual grouping above them.
Conditional styling directs the reader’s eye to the cells that matter: the highest pass rate, the lowest score, an outlier. tab_style applies a visual property and loc specifies exactly where it applies. style is the what, loc is the where.
The most common locations: - loc.body(columns, rows): specific cells in the data area - loc.column_labels(): the column header row - loc.title() / loc.subtitle(): the table header text
rows inside loc.body() accepts an integer index, a list of indices, or a lambda that receives the DataFrame and returns a boolean Series.
Key Concept: loc is a targeting system, not a filter
loc.body(rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max()) does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple tab_style calls; later ones add to earlier ones without overwriting.
Common Mistake: Passing a boolean mask directly to rows
loc.body(rows=course_detail[‘pass_rate’] == course_detail[‘pass_rate’].max()) fails because rows inside loc.body() needs a callable that receives the rendered DataFrame, not the original one. Always use a lambda: rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max().
Activity 4 - Highlight the Best Midterm Score
Goal: Take the highlighted table and add a third tab_style call that highlights the midterm cell with the highest value in a light blue (#EAF3FA). Use a lambda for the row selection.
# TODO: add a third tab_style call for the highest midterm value...
6. Summary Rows
A summary row aggregates the entire table into one footer row: a grand mean, a column total, or a count. The reader no longer needs to mentally compute the aggregate, and the table and its summary stay in the same visual unit.
grand_summary_rows(fns) adds these rows. fns is a dict mapping a display label to an aggregation function. In version 0.20, it aggregates all numeric columns in the table, so the DataFrame passed to GT should contain only the columns you want summarised:
from great_tables import vals as gt_vals # noqa: F401# Use only the score + pass_rate columns so the summary row is meaningfulcourse_scores = course_detail.drop(columns=["students"])with_summary = ( GT(course_scores) .tab_header(title=gt_md("**Course Summary with Grand Mean**")) .cols_label( course="Course", midterm="Midterm", final="Final", project="Project", pass_rate="Pass Rate", # noqa: S106 ) .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"]) .fmt_number(columns=["midterm", "final", "project"], decimals=1) .fmt_percent(columns="pass_rate", decimals=1) # noqa: S106 .grand_summary_rows( fns={"Mean": lambda x: x.mean(numeric_only=True)}, ))themed_gt(with_summary, n_rows=len(course_scores))
Course Summary with Grand Mean
Course
Score (0-100)
Pass Rate
Midterm
Final
Project
Data Structures
61.7
59.7
66.2
96.0%
Databases
60.9
59.6
65.0
96.0%
Linear Algebra
60.7
59.7
64.8
96.0%
Machine Learning
61.0
60.8
64.8
96.0%
Python Programming
60.5
59.4
65.3
96.0%
Statistics
60.2
59.8
65.6
97.0%
Mean
---
60.84666666666667
59.82
65.27999999999999
0.9616666666666666
Pro Tip: Shape the DataFrame before passing it to GT
grand_summary_rows aggregates every numeric column in the table. If a count column like students would produce a meaningless mean, drop it before calling GT(): df.drop(columns=[“students”]). If the table still includes a string column like the row label, pass numeric_only=True to the aggregation: lambda x: x.mean(numeric_only=True).
Activity 5 - Add a Min and Max Row
Goal: Extend with_summary to show three summary rows: Min, Max, and Mean across all score columns. Pass a dict with three keys to fns.
The ark.plot.gt_style module ships metrics_report(): a one-call wrapper that produces a publication-ready model comparison table. It handles formatting, brand styling, and conditional highlighting in a single call.
metrics_report(df, metrics, minimize_cols, maximize_cols) highlights the best value in each metric column: green for minimize metrics (lower is better: MAE, RMSE), green for maximize metrics (higher is better: R², accuracy). The caller decides which direction is better for each metric; the function does not guess.
Predicting average_marks from study_hours, attendance_pct, and program
Model
MAE
RMSE
R2
Linear Regression
8.210
10.420
0.781
Ridge (α=0.1)
8.090
10.310
0.784
Ridge (α=1.0)
7.980
10.190
0.788
Random Forest
7.430
9.610
0.810
university_analytics.csv · 5-fold CV · held-out 20% test set
Key Concept: metrics_report highlights by direction, not by rank
minimize_cols highlights the row with the lowest value: better for error metrics. maximize_cols highlights the row with the highest value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.
Activity 6 - Add a Gradient Boosting Row
Goal: Add a fifth row to comparison: “Gradient Boosting” with MAE=6.91, RMSE=8.84, R2=0.843: and re-run metrics_report(). Confirm the highlighted row updates automatically.
# TODO: add Gradient Boosting row and re-run metrics_report...
Capstone: Course Performance Report
Combine every technique from this notebook into one complete report table. The report should give a department head a single table they can paste into a slide deck.
Capstone Exercise - Course Performance Report
Goal:
Build a report DataFrame grouped by course with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks mean
Wrap with GT. Add a descriptive title and source note
Apply cols_label and the appropriate fmt_* for each column
Add a tab_spanner over the three score columns
Highlight the course with the highest pass rate (green) and lowest pass rate (light red)
Add a grand Mean summary row across all numeric columns
Call themed_gt() last
# Build report DataFrame first, then chain all GT methods in one expression
# TODO: build the complete course performance report...