Part 12: Presenting Data with Great Tables

Open In Colab Download Notebook

DS-MLOps Data Analysis

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the university_analytics.csv dataset and the ark.plot.gt_style module introduced alongside the plotting chapters.

A plain df.head() output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (great_tables) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 12 you will be able to:

# Skill Covered in
1 Explain when a table is the right choice over a chart Sec. 1
2 Wrap a DataFrame with GT() and apply the project brand with themed_gt() Sec. 2
3 Format numbers, percentages, and missing values with fmt_* methods Sec. 3
4 Add readable column labels with cols_label and group related columns with tab_spanner Sec. 4
5 Target cells with loc and apply styling with tab_style Sec. 5
6 Add grand summary rows with grand_summary_rows Sec. 6
7 Build a model comparison table with metrics_report() Sec. 7
from great_tables import GT, loc, md as gt_md, style
import numpy as np
import pandas as pd

from ark.plot.gt_style import metrics_report, themed_gt
from ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTED

df = pd.read_csv("data/university_analytics.csv")
df["average_marks"] = (df["midterm_score"] + df["final_score"] + df["project_score"]) / 3
df.head(3)
student_id cohort program gender region guardian has_internet course_id course semester enrollment_date study_hours attendance_pct midterm_score final_score project_score final_grade passed average_marks
0 S0001 2023 Information Technology F South Father True C01 Python Programming Fall 2023 2023-09-04 10.8 88.6 46.1 54.4 57.8 D True 52.766667
1 S0001 2023 Information Technology F South Father True C02 Statistics Spring 2024 2024-01-15 16.1 71.5 49.9 64.9 67.0 C True 60.600000
2 S0001 2023 Information Technology F South Father True C03 Data Structures Fall 2023 2023-09-04 23.6 71.5 57.3 64.1 83.0 C True 68.133333

0. The Last Mile of a Data Story

You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call df.head(), and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.

df.head() serves you in a notebook. It does not serve anyone else.

The gap between “I have the result” and “I can communicate the result” is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. Great Tables (posit-dev.github.io/great-tables) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R’s {gt} package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.

How it compares to other table tools

Tool Output Strengths When to use instead
Great Tables (posit-dev.github.io/great-tables) HTML Full layout control, colour scales, sparklines, {gt}-compatible API Reports, notebooks, any HTML output
pandas Styler (pandas docs) HTML Built-in, no extra install, fast for simple highlighting Quick colouring when GT is overkill
tabulate (tabulate on PyPI) Text, Markdown, HTML, LaTeX Lightweight, great for terminal or Markdown output CLI output, .md files
rich (rich.readthedocs.io) Terminal Beautiful terminal tables, progress bars Terminal-only display
itables (mwouts.github.io/itables) Interactive HTML Sort, filter, search in notebook Exploratory analysis, large tables

Already in your environment

uv add great-tables          # for a standalone project

Official docs: posit-dev.github.io/great-tables/articles/intro

1. When Tables Beat Charts

A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: which course has the highest midterm average? or by how many points does one program outperform another?

Use a table when: - Readers will look up a specific row or compare two exact values - The differences between groups are small and a chart would compress them into noise - A report or stakeholder document needs a citable number, not an impression

Use a chart when: - You want to show a trend, a distribution, or a relationship across many data points - The pattern matters more than the individual values

Key Concept: Tables serve lookup; charts serve pattern recognition

Neither replaces the other. A data storytelling section (Part 7) shows a trend with a chart. A summary report shows the underlying numbers in a table. The combination answers both the what happened and the by exactly how much.

2. Your First Styled Table

Every Great Tables workflow starts with GT(df): wrapping a pandas DataFrame in the Great Tables object. From there you chain methods to add structure and styling. On its own, GT(df) renders a minimal unstyled table. themed_gt() applies the project’s brand: column header background, font, border colours, and alternating row stripes: one call at the end of the chain.

The first example is a summary of mean scores by gender:

summary = (
    df.groupby("gender")
    .agg(
        n_students=("student_id", "count"),
        midterm=("midterm_score", "mean"),
        final=("final_score", "mean"),
        project=("project_score", "mean"),
        fail_rate=("passed", lambda x: (~x).mean()),
    )
    .reset_index()
    .round(2)
)
summary
gender n_students midterm final project fail_rate
0 F 1170 60.85 59.72 65.63 0.04
1 M 1128 60.93 59.71 65.07 0.04
2 Other 102 59.86 62.15 63.57 0.03

GT(df) alone already renders a table, but column names are raw and values have no formatting. Wrapping it in themed_gt() applies the brand while .tab_header() adds a title and subtitle:

table = (
    GT(summary)
    .tab_header(
        title=gt_md("**Mean Exam Scores by Gender**"),
        subtitle="Students with complete score records across all three components",
    )
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(table, n_rows=len(summary))
Mean Exam Scores by Gender
Students with complete score records across all three components
gender n_students midterm final project fail_rate
F 1170 60.85 59.72 65.63 0.04
M 1128 60.93 59.71 65.07 0.04
Other 102 59.86 62.15 63.57 0.03
Source: DS-MLOps university analytics dataset · 2,400 rows

Key Concept: The chain always ends with themed_gt()

themed_gt() applies brand-wide options (tab_options) and text styling. Call it last, after all structural methods (tab_header, cols_label, tab_spanner, etc.) so it can apply consistently across everything you have added.

Activity 1 - First Styled Table

Goal: Group df by program instead of gender, compute the same five aggregates, then wrap with GT and themed_gt. Add a title that identifies the program.
program_summary = df.groupby("program").agg(...).reset_index().round(2)
GT(program_summary).tab_header(title=gt_md("**...**"), subtitle="...").pipe(themed_gt, n_rows=len(program_summary))
# TODO: build program_summary and display with GT + themed_gt
...

3. Formatting Values and Labelling Columns

Raw floats in a table communicate false precision: a pass rate of 0.87654 signals noise, not information. Great Tables fmt_* methods format each column’s values to the right precision for its type, and cols_label replaces machine-readable column names with reader-facing ones.

The four formatting methods used most in DS tables: - fmt_number(columns, decimals): round to decimals places - fmt_integer(columns): strip decimal point, add thousands separator - fmt_percent(columns, decimals): multiply by 100 and append % - fmt_missing(columns, missing_text): replace NaN with a readable label

Example: fmt_percent turns 0.913 into 91.3%

Without formatting, fail_rate=0.04 reads as a raw proportion. With fmt_percent(columns=‘fail_rate’, decimals=1), the same cell displays as 4.0%: the reader does not need to mentally multiply by 100.

formatted = (
    GT(summary)
    .tab_header(
        title=gt_md("**Mean Exam Scores by Gender**"),
        subtitle="All figures rounded to one decimal place",
    )
    .cols_label(
        gender="Gender",
        n_students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        fail_rate="Fail Rate",  # noqa: S106
    )
    .fmt_integer(columns="n_students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="fail_rate", decimals=1)  # noqa: S106
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(formatted, n_rows=len(summary))
Mean Exam Scores by Gender
All figures rounded to one decimal place
Gender Students Midterm Final Project Fail Rate
F 1,170 60.9 59.7 65.6 4.0%
M 1,128 60.9 59.7 65.1 4.0%
Other 102 59.9 62.1 63.6 3.0%
Source: DS-MLOps university analytics dataset · 2,400 rows

Pro Tip: fmt_missing catches the NaN before the reader sees it

Any column that can contain NaN: a score column with ~3% missing, an optional field: should have fmt_missing(columns=…, missing_text=“:”) added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?

Activity 2 - Format the Program Table

Goal: Take the program_summary from Activity 1 and add cols_label, fmt_integer, fmt_number, and fmt_percent to match the formatted table above.
formatted_programs = (
    GT(program_summary)
    .cols_label(program="Program", n_students="Students", ...)
    .fmt_integer(columns="n_students")
    .fmt_number(columns=[...], decimals=1)
    .fmt_percent(columns="fail_rate", decimals=1)
)
themed_gt(formatted_programs, n_rows=len(program_summary))
# TODO: add cols_label and fmt_* to your program_summary table
...

4. Column Spanners

When a table has several columns that belong to a natural group, for example three score columns or multiple model metrics, a column spanner adds a shared header label above the group. This reduces cognitive load: the reader understands the table structure before reading the individual values.

tab_spanner(label, columns) draws the label above the specified columns. It does not move or reorder columns; it only adds a visual grouping above them.

# Course performance table: one row per course
course_detail = (
    df.groupby("course")
    .agg(
        students=("student_id", "count"),
        midterm=("midterm_score", "mean"),
        final=("final_score", "mean"),
        project=("project_score", "mean"),
        pass_rate=("passed", "mean"),
    )
    .reset_index()
    .round(2)
)
course_detail
course students midterm final project pass_rate
0 Data Structures 400 61.74 59.67 66.24 0.96
1 Databases 400 60.90 59.65 64.97 0.96
2 Linear Algebra 400 60.74 59.67 64.76 0.96
3 Machine Learning 400 61.00 60.78 64.83 0.96
4 Python Programming 400 60.54 59.35 65.29 0.96
5 Statistics 400 60.16 59.80 65.59 0.97
with_spanner = (
    GT(course_detail)
    .tab_header(title=gt_md("**Performance by Course**"))
    .cols_label(
        course="Course",
        students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_integer(columns="students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .tab_source_note("Source: DS-MLOps university analytics dataset · 2,400 rows")
)
themed_gt(with_spanner, n_rows=len(course_detail))
Performance by Course
Course Students Score (0-100) Pass Rate
Midterm Final Project
Data Structures 400 61.7 59.7 66.2 96.0%
Databases 400 60.9 59.6 65.0 96.0%
Linear Algebra 400 60.7 59.7 64.8 96.0%
Machine Learning 400 61.0 60.8 64.8 96.0%
Python Programming 400 60.5 59.4 65.3 96.0%
Statistics 400 60.2 59.8 65.6 97.0%
Source: DS-MLOps university analytics dataset · 2,400 rows
Activity 3 - Add a Spanner

Goal: Take the formatted_programs table from Activity 2 and add a tab_spanner labelled “Scores (0-100)” over the three score columns.
GT(program_summary)
    ...
    .tab_spanner(label="Scores (0-100)", columns=["midterm", "final", "project"])
    ...
# TODO: add a tab_spanner to the program table
...

5. Conditional Styling

Conditional styling directs the reader’s eye to the cells that matter: the highest pass rate, the lowest score, an outlier. tab_style applies a visual property and loc specifies exactly where it applies. style is the what, loc is the where.

The most common locations: - loc.body(columns, rows): specific cells in the data area - loc.column_labels(): the column header row - loc.title() / loc.subtitle(): the table header text

rows inside loc.body() accepts an integer index, a list of indices, or a lambda that receives the DataFrame and returns a boolean Series.

highlighted = (
    GT(course_detail)
    .tab_header(title=gt_md("**Course Performance: Best and Worst Pass Rate**"))
    .cols_label(
        course="Course",
        students="Students",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_integer(columns="students")
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .tab_style(
        style=style.fill(color=SUCCESS),
        locations=loc.body(
            columns="pass_rate",
            rows=lambda df_gt: df_gt["pass_rate"] == df_gt["pass_rate"].max(),
        ),
    )
    .tab_style(
        style=style.fill(color="#FEF2F2"),
        locations=loc.body(
            columns="pass_rate",
            rows=lambda df_gt: df_gt["pass_rate"] == df_gt["pass_rate"].min(),
        ),
    )
)
themed_gt(highlighted, n_rows=len(course_detail))
Course Performance: Best and Worst Pass Rate
Course Students Score (0-100) Pass Rate
Midterm Final Project
Data Structures 400 61.7 59.7 66.2 96.0%
Databases 400 60.9 59.6 65.0 96.0%
Linear Algebra 400 60.7 59.7 64.8 96.0%
Machine Learning 400 61.0 60.8 64.8 96.0%
Python Programming 400 60.5 59.4 65.3 96.0%
Statistics 400 60.2 59.8 65.6 97.0%

Key Concept: loc is a targeting system, not a filter

loc.body(rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max()) does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple tab_style calls; later ones add to earlier ones without overwriting.

Common Mistake: Passing a boolean mask directly to rows

loc.body(rows=course_detail[‘pass_rate’] == course_detail[‘pass_rate’].max()) fails because rows inside loc.body() needs a callable that receives the rendered DataFrame, not the original one. Always use a lambda: rows=lambda df: df[‘pass_rate’] == df[‘pass_rate’].max().

Activity 4 - Highlight the Best Midterm Score

Goal: Take the highlighted table and add a third tab_style call that highlights the midterm cell with the highest value in a light blue (#EAF3FA). Use a lambda for the row selection.
.tab_style(
    style=style.fill(color="#EAF3FA"),
    locations=loc.body(
        columns="midterm",
        rows=lambda df: df["midterm"] == df["midterm"].max(),
    ),
)
# TODO: add a third tab_style call for the highest midterm value
...

6. Summary Rows

A summary row aggregates the entire table into one footer row: a grand mean, a column total, or a count. The reader no longer needs to mentally compute the aggregate, and the table and its summary stay in the same visual unit.

grand_summary_rows(fns) adds these rows. fns is a dict mapping a display label to an aggregation function. In version 0.20, it aggregates all numeric columns in the table, so the DataFrame passed to GT should contain only the columns you want summarised:

from great_tables import vals as gt_vals  # noqa: F401

# Use only the score + pass_rate columns so the summary row is meaningful
course_scores = course_detail.drop(columns=["students"])

with_summary = (
    GT(course_scores)
    .tab_header(title=gt_md("**Course Summary with Grand Mean**"))
    .cols_label(
        course="Course",
        midterm="Midterm",
        final="Final",
        project="Project",
        pass_rate="Pass Rate",  # noqa: S106
    )
    .tab_spanner(label="Score (0-100)", columns=["midterm", "final", "project"])
    .fmt_number(columns=["midterm", "final", "project"], decimals=1)
    .fmt_percent(columns="pass_rate", decimals=1)  # noqa: S106
    .grand_summary_rows(
        fns={"Mean": lambda x: x.mean(numeric_only=True)},
    )
)
themed_gt(with_summary, n_rows=len(course_scores))
Course Summary with Grand Mean
Course Score (0-100) Pass Rate
Midterm Final Project
  Data Structures 61.7 59.7 66.2 96.0%
  Databases 60.9 59.6 65.0 96.0%
  Linear Algebra 60.7 59.7 64.8 96.0%
  Machine Learning 61.0 60.8 64.8 96.0%
  Python Programming 60.5 59.4 65.3 96.0%
  Statistics 60.2 59.8 65.6 97.0%
Mean --- 60.84666666666667 59.82 65.27999999999999 0.9616666666666666

Pro Tip: Shape the DataFrame before passing it to GT

grand_summary_rows aggregates every numeric column in the table. If a count column like students would produce a meaningless mean, drop it before calling GT(): df.drop(columns=[“students”]). If the table still includes a string column like the row label, pass numeric_only=True to the aggregation: lambda x: x.mean(numeric_only=True).

Activity 5 - Add a Min and Max Row

Goal: Extend with_summary to show three summary rows: Min, Max, and Mean across all score columns. Pass a dict with three keys to fns.
fns={"Min": lambda x: x.min(numeric_only=True), "Max": lambda x: x.max(numeric_only=True), "Mean": lambda x: x.mean(numeric_only=True)}
# TODO: add Min/Max/Mean grand summary rows
...

7. Model Comparison with metrics_report()

The ark.plot.gt_style module ships metrics_report(): a one-call wrapper that produces a publication-ready model comparison table. It handles formatting, brand styling, and conditional highlighting in a single call.

metrics_report(df, metrics, minimize_cols, maximize_cols) highlights the best value in each metric column: green for minimize metrics (lower is better: MAE, RMSE), green for maximize metrics (higher is better: R², accuracy). The caller decides which direction is better for each metric; the function does not guess.

comparison = pd.DataFrame(
    {
        "Model": ["Linear Regression", "Ridge (α=0.1)", "Ridge (α=1.0)", "Random Forest"],
        "MAE": [8.21, 8.09, 7.98, 7.43],
        "RMSE": [10.42, 10.31, 10.19, 9.61],
        "R2": [0.781, 0.784, 0.788, 0.810],
    }
)

metrics_report(
    comparison,
    metrics=["MAE", "RMSE", "R2"],
    minimize_cols=["MAE", "RMSE"],
    maximize_cols=["R2"],
    title="Grade Prediction: Model Comparison",
    subtitle="Predicting average_marks from study_hours, attendance_pct, and program",
    source_note="university_analytics.csv · 5-fold CV · held-out 20% test set",
)
Grade Prediction: Model Comparison
Predicting average_marks from study_hours, attendance_pct, and program
Model MAE RMSE R2
Linear Regression 8.210 10.420 0.781
Ridge (α=0.1) 8.090 10.310 0.784
Ridge (α=1.0) 7.980 10.190 0.788
Random Forest 7.430 9.610 0.810
university_analytics.csv · 5-fold CV · held-out 20% test set

Key Concept: metrics_report highlights by direction, not by rank

minimize_cols highlights the row with the lowest value: better for error metrics. maximize_cols highlights the row with the highest value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.

Activity 6 - Add a Gradient Boosting Row

Goal: Add a fifth row to comparison: “Gradient Boosting” with MAE=6.91, RMSE=8.84, R2=0.843: and re-run metrics_report(). Confirm the highlighted row updates automatically.
comparison = pd.concat([comparison, pd.DataFrame([{"Model": "Gradient Boosting", "MAE": 6.91, "RMSE": 8.84, "R2": 0.843}])], ignore_index=True)
metrics_report(comparison, metrics=[...], minimize_cols=[...], maximize_cols=[...], ...)
# TODO: add Gradient Boosting row and re-run metrics_report
...

Capstone: Course Performance Report

Combine every technique from this notebook into one complete report table. The report should give a department head a single table they can paste into a slide deck.

Capstone Exercise - Course Performance Report

Goal:
  1. Build a report DataFrame grouped by course with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks mean
  2. Wrap with GT. Add a descriptive title and source note
  3. Apply cols_label and the appropriate fmt_* for each column
  4. Add a tab_spanner over the three score columns
  5. Highlight the course with the highest pass rate (green) and lowest pass rate (light red)
  6. Add a grand Mean summary row across all numeric columns
  7. Call themed_gt() last
# Build report DataFrame first, then chain all GT methods in one expression
# TODO: build the complete course performance report
...

Further Reading

Resource Why it matters
Great Tables documentation Complete API reference with rendered examples for every method
Great Tables: loc reference Full list of location helpers: loc.body, loc.column_labels, loc.spanner_labels, loc.grand_summary
Great Tables blog: Python tables Worked examples including financial reports and ML comparison tables
Knaflic, C.N. (2015). Storytelling with Data. Wiley. Chapter 2 covers when tables serve communication better than charts
pandas GroupBy.agg reference Named aggregations (col=(src, func)) used throughout this notebook

Summary

GT method What it does
GT(df) Wrap a DataFrame and begin the method chain
themed_gt(table, n_rows=n) Apply project brand: header colors, font, striped rows. Call last.
tab_header(title, subtitle) Add a title row above the column headers
tab_source_note(text) Add an attribution line below the table
cols_label(**kwargs) Replace column names with reader-facing labels
fmt_number(columns, decimals) Round floats to decimals places
fmt_integer(columns) Remove decimal, add thousands separator
fmt_percent(columns, decimals) Multiply by 100 and append %
fmt_missing(columns, missing_text) Replace NaN with a readable placeholder
tab_spanner(label, columns) Group related columns under a shared header label
tab_style(style, locations) Apply a visual property (fill, text) to a location (loc.body, loc.column_labels)
loc.body(columns, rows) Target specific cells; rows takes an index or a lambda
grand_summary_rows(fns) Add one summary row per key in fns; aggregates all numeric columns
metrics_report(df, metrics, ...) One-call ML comparison table with directional highlighting

Next: Part 3: Dev Tools covers the professional toolchain: uv, ruff, type annotations, git, pytest, and pre-commit.