Part 12: Presenting Data with Great Tables

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the university_analytics.csv dataset and the ark.plot.gt_style module introduced alongside the plotting chapters.

A plain df.head() output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (great_tables) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.

Callout markers used throughout this notebook are explained on the book cover page.

Learning Objectives

By the end of Part 12 you will be able to:

#	Skill	Covered in
1	Explain when a table is the right choice over a chart	Sec. 1
2	Wrap a DataFrame with `GT()` and apply the project brand with `themed_gt()`	Sec. 2
3	Format numbers, percentages, and missing values with `fmt_*` methods	Sec. 3
4	Add readable column labels with `cols_label` and group related columns with `tab_spanner`	Sec. 4
5	Target cells with `loc` and apply styling with `tab_style`	Sec. 5
6	Add grand summary rows with `grand_summary_rows`	Sec. 6
7	Build a model comparison table with `metrics_report()`	Sec. 7

from great_tables import GT, loc, md as gt_md, style
import numpy as np
import pandas as pd

from ark.plot.gt_style import metrics_report, themed_gt
from ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTED

df = pd.read_csv("data/university_analytics.csv")
df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df.head(3)

	student_id	cohort	program	gender	region	guardian	has_internet	course_id	course	semester	enrollment_date	study_hours	attendance_pct	midterm_score	final_score	project_score	final_grade	passed	average_marks
0	S0001	2023	Information Technology	F	South	Father	True	C01	Python Programming	Fall 2023	2023-09-04	10.8	88.6	46.1	54.4	57.8	D	True	52.766667
1	S0001	2023	Information Technology	F	South	Father	True	C02	Statistics	Spring 2024	2024-01-15	16.1	71.5	49.9	64.9	67.0	C	True	60.600000
2	S0001	2023	Information Technology	F	South	Father	True	C03	Data Structures	Fall 2023	2023-09-04	23.6	71.5	57.3	64.1	83.0	C	True	68.133333

0. The Last Mile of a Data Story

You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call df.head(), and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.

df.head() serves you in a notebook. It does not serve anyone else.

The gap between “I have the result” and “I can communicate the result” is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. Great Tables (posit-dev.github.io/great-tables) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R’s {gt} package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.

How it compares to other table tools

Tool	Output	Strengths	When to use instead
Great Tables (posit-dev.github.io/great-tables)	HTML	Full layout control, colour scales, sparklines, `{gt}`-compatible API	Reports, notebooks, any HTML output
pandas Styler (pandas docs)	HTML	Built-in, no extra install, fast for simple highlighting	Quick colouring when GT is overkill
tabulate (tabulate on PyPI)	Text, Markdown, HTML, LaTeX	Lightweight, great for terminal or Markdown output	CLI output, `.md` files
rich (rich.readthedocs.io)	Terminal	Beautiful terminal tables, progress bars	Terminal-only display
itables (mwouts.github.io/itables)	Interactive HTML	Sort, filter, search in notebook	Exploratory analysis, large tables

Already in your environment

uv add great-tables          # for a standalone project

Official docs: posit-dev.github.io/great-tables/articles/intro

1. When Tables Beat Charts

A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: which course has the highest midterm average? or by how many points does one program outperform another?

Use a table when: - Readers will look up a specific row or compare two exact values - The differences between groups are small and a chart would compress them into noise - A report or stakeholder document needs a citable number, not an impression

Use a chart when: - You want to show a trend, a distribution, or a relationship across many data points - The pattern matters more than the individual values

Key Concept: Tables serve lookup; charts serve pattern recognition

Neither replaces the other. A data storytelling section (Part 7) shows a trend with a chart. A summary report shows the underlying numbers in a table. The combination answers both the what happened and the by exactly how much.

2. Your First Styled Table

Every Great Tables workflow starts with GT(df): wrapping a pandas DataFrame in the Great Tables object. From there you chain methods to add structure and styling. On its own, GT(df) renders a minimal unstyled table. themed_gt() applies the project’s brand: column header background, font, border colours, and alternating row stripes: one call at the end of the chain.

The first example is a summary of mean scores by gender:

summary = (
    df.groupby("gender")
    .agg(
        n_students=("student_id", "count"),
        midterm=("midterm_score", "mean"),
        final=("final_score", "mean"),
        project=("project_score", "mean"),
        fail_rate=("passed", lambda x: (~x).mean()),
    )
    .reset_index()
    .round(2)
)
summary

	gender	n_students	midterm	final	project	fail_rate
0	F	1170	60.85	59.72	65.63	0.04
1	M	1128	60.93	59.71	65.07	0.04
2	Other	102	59.86	62.15	63.57	0.03

GT(df) alone already renders a table, but column names are raw and values have no formatting. Wrapping it in themed_gt() applies the brand while .tab_header() adds a title and subtitle:

table = (
    GT(summary)
    .tab_header(
        title=gt_md("**Mean Exam Scores by Gender**"),
        subtitle="Students with complete score records across all three components",
    )
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(table, n_rows=len(summary))

gender	n_students	midterm	final	project	fail_rate
Mean Exam Scores by Gender
Students with complete score records across all three components
F	1170	60.85	59.72	65.63	0.04
M	1128	60.93	59.71	65.07	0.04
Other	102	59.86	62.15	63.57	0.03
Source: DS-MLOps university analytics dataset · 2,400 rows

Key Concept: The chain always ends with themed_gt()

themed_gt() applies brand-wide options (tab_options) and text styling. Call it last, after all structural methods (tab_header, cols_label, tab_spanner, etc.) so it can apply consistently across everything you have added.

Activity 1 - First Styled Table

Goal: Group df by program instead of gender, compute the same five aggregates, then wrap with GT and themed_gt. Add a title that identifies the program.

program_summary = df.groupby("program").agg(...).reset_index().round(2)
GT(program_summary).tab_header(title=gt_md("**...**"), subtitle="...").pipe(themed_gt, n_rows=len(program_summary))

# TODO: build program_summary and display with GT + themed_gt
...

3. Formatting Values and Labelling Columns

Raw floats in a table communicate false precision: a pass rate of 0.87654 signals noise, not information. Great Tables fmt_* methods format each column’s values to the right precision for its type, and cols_label replaces machine-readable column names with reader-facing ones.

The four formatting methods used most in DS tables: - fmt_number(columns, decimals): round to decimals places - fmt_integer(columns): strip decimal point, add thousands separator - fmt_percent(columns, decimals): multiply by 100 and append % - fmt_missing(columns, missing_text): replace NaN with a readable label

Example: fmt_percent turns 0.913 into 91.3%

Without formatting, fail_rate=0.04 reads as a raw proportion. With fmt_percent(columns=‘fail_rate’, decimals=1), the same cell displays as 4.0%: the reader does not need to mentally multiply by 100.

formatted = (
    GT(summary)
    .tab_header(
        title=gt_md("**Mean Exam Scores by Gender**"),
        subtitle="All figures rounded to one decimal place",
    )
    .cols_label(
        gender="Gender",
        n_students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        fail_rate="Fail Rate",  # noqa: S106
    )
    .fmt_integer(columns="n_students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="fail_rate", decimals=1)  # noqa: S106
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(formatted, n_rows=len(summary))

Gender	Students	Midterm	Final	Project	Fail Rate
Mean Exam Scores by Gender
All figures rounded to one decimal place
F	1,170	60.9	59.7	65.6	4.0%
M	1,128	60.9	59.7	65.1	4.0%
Other	102	59.9	62.1	63.6	3.0%
Source: DS-MLOps university analytics dataset · 2,400 rows

Pro Tip: fmt_missing catches the NaN before the reader sees it

Any column that can contain NaN: a score column with ~3% missing, an optional field: should have fmt_missing(columns=…, missing_text=“:”) added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?

Activity 2 - Format the Program Table

Goal: Take the program_summary from Activity 1 and add cols_label, fmt_integer, fmt_number, and fmt_percent to match the formatted table above.

formatted_programs = (
    GT(program_summary)
    .cols_label(program="Program", n_students="Students", ...)
    .fmt_integer(columns="n_students")
    .fmt_number(columns=[...], decimals=1)
    .fmt_percent(columns="fail_rate", decimals=1)
)
themed_gt(formatted_programs, n_rows=len(program_summary))

# TODO: add cols_label and fmt_* to your program_summary table
...

4. Column Spanners

When a table has several columns that belong to a natural group, for example three score columns or multiple model metrics, a column spanner adds a shared header label above the group. This reduces cognitive load: the reader understands the table structure before reading the individual values.

tab_spanner(label, columns) draws the label above the specified columns. It does not move or reorder columns; it only adds a visual grouping above them.

# Course performance table: one row per course
course_detail = (
    df.groupby("course")
    .agg(
        students=("student_id", "count"),
        midterm=("midterm_score", "mean"),
        final=("final_score", "mean"),
        project=("project_score", "mean"),
        pass_rate=("passed", "mean"),
    )
    .reset_index()
    .round(2)
)
course_detail

	course	students	midterm	final	project	pass_rate
0	Data Structures	400	61.74	59.67	66.24	0.96
1	Databases	400	60.90	59.65	64.97	0.96
2	Linear Algebra	400	60.74	59.67	64.76	0.96
3	Machine Learning	400	61.00	60.78	64.83	0.96
4	Python Programming	400	60.54	59.35	65.29	0.96
5	Statistics	400	60.16	59.80	65.59	0.97

with_spanner = (
    GT(course_detail)
    .tab_header(title=gt_md("**Performance by Course**"))
    .cols_label(
        course="Course",
        students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_integer(columns="students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(with_spanner, n_rows=len(course_detail))

Course	Students	Score (0-100)			Pass Rate
Performance by Course
Course	Students	Midterm	Final	Project	Pass Rate
Data Structures	400	61.7	59.7	66.2	96.0%
Databases	400	60.9	59.6	65.0	96.0%
Linear Algebra	400	60.7	59.7	64.8	96.0%
Machine Learning	400	61.0	60.8	64.8	96.0%
Python Programming	400	60.5	59.4	65.3	96.0%
Statistics	400	60.2	59.8	65.6	97.0%
Source: DS-MLOps university analytics dataset · 2,400 rows

Activity 3 - Add a Spanner

Goal: Take the formatted_programs table from Activity 2 and add a tab_spanner labelled “Scores (0-100)” over the three score columns.

GT(program_summary)
    ...
    .tab_spanner(label="Scores (0-100)", columns=["midterm", "final", "project"])
    ...

# TODO: add a tab_spanner to the program table
...

5. Conditional Styling

Conditional styling directs the reader’s eye to the cells that matter: the highest pass rate, the lowest score, an outlier. tab_style applies a visual property and loc specifies exactly where it applies. style is the what, loc is the where.

The most common locations: - loc.body(columns, rows): specific cells in the data area - loc.column_labels(): the column header row - loc.title() / loc.subtitle(): the table header text

rows inside loc.body() accepts an integer index, a list of indices, or a lambda that receives the DataFrame and returns a boolean Series.

highlighted = (
    GT(course_detail)
    .tab_header(title=gt_md("**Course Performance: Best and Worst Pass Rate**"))
    .cols_label(
        course="Course",
        students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_integer(columns="students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .tab_style(
        style=style.fill(color=SUCCESS),
        locations=loc.body(
            columns="pass_rate",
            rows=lambda df_gt: df_gt["pass_rate"] == df_gt["pass_rate"].max(),
        ),
    )
    .tab_style(
        style=style.fill(color="#FEF2F2"),
        locations=loc.body(
            columns="pass_rate",
            rows=lambda df_gt: df_gt["pass_rate"] == df_gt["pass_rate"].min(),
        ),
    )
)
themed_gt(highlighted, n_rows=len(course_detail))

Course	Students	Score (0-100)			Pass Rate
Course Performance: Best and Worst Pass Rate
Course	Students	Midterm	Final	Project	Pass Rate
Data Structures	400	61.7	59.7	66.2	96.0%
Databases	400	60.9	59.6	65.0	96.0%
Linear Algebra	400	60.7	59.7	64.8	96.0%
Machine Learning	400	61.0	60.8	64.8	96.0%
Python Programming	400	60.5	59.4	65.3	96.0%
Statistics	400	60.2	59.8	65.6	97.0%

Key Concept: loc is a targeting system, not a filter

loc.body(rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max()) does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple tab_style calls; later ones add to earlier ones without overwriting.

Common Mistake: Passing a boolean mask directly to rows

loc.body(rows=course_detail[‘pass_rate’] == course_detail[‘pass_rate’].max()) fails because rows inside loc.body() needs a callable that receives the rendered DataFrame, not the original one. Always use a lambda: rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max().

Activity 4 - Highlight the Best Midterm Score

Goal: Take the highlighted table and add a third tab_style call that highlights the midterm cell with the highest value in a light blue (#EAF3FA). Use a lambda for the row selection.

.tab_style(
    style=style.fill(color="#EAF3FA"),
    locations=loc.body(
        columns="midterm",
        rows=lambda df: df["midterm"] == df["midterm"].max(),
    ),
)

# TODO: add a third tab_style call for the highest midterm value
...

6. Summary Rows

A summary row aggregates the entire table into one footer row: a grand mean, a column total, or a count. The reader no longer needs to mentally compute the aggregate, and the table and its summary stay in the same visual unit.

grand_summary_rows(fns) adds these rows. fns is a dict mapping a display label to an aggregation function. In version 0.20, it aggregates all numeric columns in the table, so the DataFrame passed to GT should contain only the columns you want summarised:

from great_tables import vals as gt_vals  # noqa: F401

# Use only the score + pass_rate columns so the summary row is meaningful
course_scores = course_detail.drop(columns=["students"])

with_summary = (
    GT(course_scores)
    .tab_header(title=gt_md("**Course Summary with Grand Mean**"))
    .cols_label(
        course="Course",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .grand_summary_rows(
        fns={"Mean": lambda x: x.mean(numeric_only=True)},
    )
)
themed_gt(with_summary, n_rows=len(course_scores))

	Course	Score (0-100)			Pass Rate
Course Summary with Grand Mean
	Course	Midterm	Final	Project	Pass Rate
	Data Structures	61.7	59.7	66.2	96.0%
	Databases	60.9	59.6	65.0	96.0%
	Linear Algebra	60.7	59.7	64.8	96.0%
	Machine Learning	61.0	60.8	64.8	96.0%
	Python Programming	60.5	59.4	65.3	96.0%
	Statistics	60.2	59.8	65.6	97.0%
Mean	---	60.84666666666667	59.82	65.27999999999999	0.9616666666666666

Pro Tip: Shape the DataFrame before passing it to GT

grand_summary_rows aggregates every numeric column in the table. If a count column like students would produce a meaningless mean, drop it before calling GT(): df.drop(columns=[“students”]). If the table still includes a string column like the row label, pass numeric_only=True to the aggregation: lambda x: x.mean(numeric_only=True).

Activity 5 - Add a Min and Max Row

Goal: Extend with_summary to show three summary rows: Min, Max, and Mean across all score columns. Pass a dict with three keys to fns.

fns={"Min": lambda x: x.min(numeric_only=True), "Max": lambda x: x.max(numeric_only=True), "Mean": lambda x: x.mean(numeric_only=True)}

# TODO: add Min/Max/Mean grand summary rows
...

7. Model Comparison with `metrics_report()`

The ark.plot.gt_style module ships metrics_report(): a one-call wrapper that produces a publication-ready model comparison table. It handles formatting, brand styling, and conditional highlighting in a single call.

metrics_report(df, metrics, minimize_cols, maximize_cols) highlights the best value in each metric column: green for minimize metrics (lower is better: MAE, RMSE), green for maximize metrics (higher is better: R², accuracy). The caller decides which direction is better for each metric; the function does not guess.

comparison = pd.DataFrame(
    {
        "Model": ["Linear Regression", "Ridge (α=0.1)", "Ridge (α=1.0)", "Random Forest"],
        "MAE": [8.21, 8.09, 7.98, 7.43],
        "RMSE": [10.42, 10.31, 10.19, 9.61],
        "R2": [0.781, 0.784, 0.788, 0.810],
    }
)

metrics_report(
    comparison,
    metrics=["MAE", "RMSE", "R2"],
    minimize_cols=["MAE", "RMSE"],
    maximize_cols=["R2"],
    title="Grade Prediction: Model Comparison",
    subtitle="Predicting average_marks from study_hours, attendance_pct, and program",
    source_note="university_analytics.csv · 5-fold CV · held-out 20% test set",
)

Model	MAE	RMSE	R2
Grade Prediction: Model Comparison
Predicting average_marks from study_hours, attendance_pct, and program
Linear Regression	8.210	10.420	0.781
Ridge (α=0.1)	8.090	10.310	0.784
Ridge (α=1.0)	7.980	10.190	0.788
Random Forest	7.430	9.610	0.810
university_analytics.csv · 5-fold CV · held-out 20% test set

Key Concept: metrics_report highlights by direction, not by rank

minimize_cols highlights the row with the lowest value: better for error metrics. maximize_cols highlights the row with the highest value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.

Activity 6 - Add a Gradient Boosting Row

Goal: Add a fifth row to comparison: “Gradient Boosting” with MAE=6.91, RMSE=8.84, R2=0.843: and re-run metrics_report(). Confirm the highlighted row updates automatically.

comparison = pd.concat([comparison, pd.DataFrame([{"Model": "Gradient Boosting", "MAE": 6.91, "RMSE": 8.84, "R2": 0.843}])], ignore_index=True)
metrics_report(comparison, metrics=[...], minimize_cols=[...], maximize_cols=[...], ...)

# TODO: add Gradient Boosting row and re-run metrics_report
...

Capstone: Course Performance Report

Combine every technique from this notebook into one complete report table. The report should give a department head a single table they can paste into a slide deck.

Capstone Exercise - Course Performance Report

Goal:

Build a report DataFrame grouped by course with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks mean
Wrap with GT. Add a descriptive title and source note
Apply cols_label and the appropriate fmt_* for each column
Add a tab_spanner over the three score columns
Highlight the course with the highest pass rate (green) and lowest pass rate (light red)
Add a grand Mean summary row across all numeric columns
Call themed_gt() last

# Build report DataFrame first, then chain all GT methods in one expression

# TODO: build the complete course performance report
...

Resource	Why it matters
Great Tables documentation	Complete API reference with rendered examples for every method
Great Tables: `loc` reference	Full list of location helpers: `loc.body`, `loc.column_labels`, `loc.spanner_labels`, `loc.grand_summary`
Great Tables blog: Python tables	Worked examples including financial reports and ML comparison tables
Knaflic, C.N. (2015). Storytelling with Data. Wiley.	Chapter 2 covers when tables serve communication better than charts
pandas `GroupBy.agg` reference	Named aggregations (`col=(src, func)`) used throughout this notebook

Summary

GT method	What it does
`GT(df)`	Wrap a DataFrame and begin the method chain
`themed_gt(table, n_rows=n)`	Apply project brand: header colors, font, striped rows. Call last.
`tab_header(title, subtitle)`	Add a title row above the column headers
`tab_source_note(text)`	Add an attribution line below the table
`cols_label(**kwargs)`	Replace column names with reader-facing labels
`fmt_number(columns, decimals)`	Round floats to `decimals` places
`fmt_integer(columns)`	Remove decimal, add thousands separator
`fmt_percent(columns, decimals)`	Multiply by 100 and append `%`
`fmt_missing(columns, missing_text)`	Replace `NaN` with a readable placeholder
`tab_spanner(label, columns)`	Group related columns under a shared header label
`tab_style(style, locations)`	Apply a visual property (`fill`, `text`) to a location (`loc.body`, `loc.column_labels`)
`loc.body(columns, rows)`	Target specific cells; `rows` takes an index or a lambda
`grand_summary_rows(fns)`	Add one summary row per key in `fns`; aggregates all numeric columns
`metrics_report(df, metrics, ...)`	One-call ML comparison table with directional highlighting

Next: Part 3: Dev Tools covers the professional toolchain: uv, ruff, type annotations, git, pytest, and pre-commit.