Part 7: Data Storytelling and House Style

DS-MLOps Python Foundations

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Part 5 (05-matplotlib.ipynb) and Part 6 (06-lets-plot.ipynb). Parts 5 and 6 taught you how to make a chart at all. This part is about taste: which chart to make, what to leave out of it, and how to brand it once it is ready to show someone other than yourself, using this project’s own ark.plot module.

Topics covered

Topic	Why it matters
Chart selection	The question decides the chart, not the data type
Data-ink ratio	Every pixel that is not data is a tax on the reader’s attention
Common chart lies	The same data, drawn to mislead instead of inform
Tables vs. charts	Exact lookup wants a table; pattern recognition wants a chart
House style	Replacing per-chart styling with one reusable theme

Callout markers used throughout this notebook are explained on the book cover page.

Learning Objectives

By the end of Part 7 you will be able to:

#	Skill	Covered in
1	Choose a chart type based on the question, not habit	Sec. 1
2	Recognise and avoid the most common ways charts mislead	Sec. 2
3	Decide when a table communicates better than a chart	Sec. 3
4	Apply `ark.plot`’s house theme to matplotlib and lets-plot charts	Sec. 4
5	Identify preattentive attributes and use them deliberately	Sec. 5
6	Add annotations that carry the chart’s narrative	Sec. 6
7	Explain why summary statistics alone are never sufficient (Datasaurus)	Sec. 7
8	Produce one polished, captioned chart from a messy dataset	Capstone

1. Picking the Right Chart, and Cutting What Does Not Help

The most common visualisation mistake is not a styling problem, it is a selection problem: building a chart type out of habit instead of asking what question it needs to answer. A pie chart cannot answer “which course improved the most,” because comparing angles is something people are measurably worse at than comparing positions along a shared axis. That is not opinion, it is the finding of a well-known study by Cleveland and McGill (1984) ranking how accurately people read different visual encodings: position along a common scale first, then length, then angle, then area, with colour saturation last.

import numpy as np
import pandas as pd

# Load the shared university dataset used across all Part 1 and Part 2 notebooks
df7 = pd.read_csv("data/university_analytics.csv")

# `results` keeps the course/semester/exam_score structure that cells below depend on
results = df7[["course", "semester", "final_score"]].copy()
results = results.rename(columns={"final_score": "exam_score"})
results.head()

	course	semester	exam_score
0	Python Programming	Fall 2023	54.4
1	Statistics	Spring 2024	64.9
2	Data Structures	Fall 2023	64.1
3	Linear Algebra	Spring 2024	59.5
4	Machine Learning	Fall 2023	59.4

Data-ink ratio, a term from Edward Tufte’s The Visual Display of Quantitative Information, is the proportion of a chart’s ink that represents actual data, as opposed to gridlines, borders, redundant legends, and other decoration. Maximising it does not mean a bare chart, it means every mark earns its place. Compare matplotlib’s defaults to a version with the same data and nothing extra:

import matplotlib.pyplot as plt

course_means = results.groupby("course")["exam_score"].mean()

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

# Left: matplotlib defaults. Box on all four sides, both gridlines, a
# legend the bar colours already make redundant since each bar has its
# own label directly underneath it.
axes[0].bar(course_means.index, course_means.values, color=["#4477AA", "#EE6677", "#228833"])
axes[0].set_title("Before: matplotlib defaults")
axes[0].grid(True)

# Right: same data, decluttered. No top/right spines, no gridlines, the
# value printed directly on each bar instead of relying on the y-axis.
axes[1].bar(course_means.index, course_means.values, color="#4477AA")
axes[1].spines["top"].set_visible(False)
axes[1].spines["right"].set_visible(False)
axes[1].set_yticks([])
for i, value in enumerate(course_means.values):
    axes[1].text(i, value + 1, f"{value:.0f}", ha="center")
axes[1].set_title("After: higher data-ink ratio")

fig.tight_layout()

Key Concept: Data-ink ratio

Every gridline, border, and redundant legend entry competes with your data for the reader’s attention. The right-hand chart above removes the y-axis entirely (the labelled bars make it redundant) and the top/right spines, and prints each value once, directly on its bar, instead of forcing the reader to trace a line back to an axis. Decluttering is not about making a chart sparse for its own sake, it is about removing anything that does not carry information.

One message per chart is the last principle worth internalising before the chart-lie gallery in Section 2: a chart that tries to answer three questions at once usually answers none of them clearly. If you find yourself adding a third colour dimension, a secondary y-axis, and a trend line to the same chart, that is a sign you need two charts, not one crowded one.

2. Common Chart Lies

None of the charts below contain incorrect numbers. Each one is drawn in a way that makes the reader’s brain compute the wrong conclusion from correct data. Recognising these on sight, in your own draft charts, is the actual skill.

fig, axes = plt.subplots(1, 2, figsize=(9, 3.5))

axes[0].bar(course_means.index, course_means.values, color="#4477AA")
axes[0].set_ylim(0, 100)
axes[0].set_title("Honest: y-axis starts at 0")

axes[1].bar(course_means.index, course_means.values, color="#EE6677")
axes[1].set_ylim(65, 78)
axes[1].set_title("Misleading: y-axis starts at 65")

fig.tight_layout()

Common Mistake: Truncating the y-axis

The two charts above show the exact same three numbers. The right-hand one starts its y-axis at 65 instead of 0, which makes a 6-point difference between courses look like one course scores three times higher than another. Bar charts in particular must start at zero, because the reader judges bar height (and therefore a ratio) automatically. Line charts can sometimes justify a non-zero baseline if it is labelled clearly, bars almost never can.

fig, ax = plt.subplots(figsize=(6, 3.5))

cohort_size = np.array([320, 340, 410, 580])
avg_score = np.array([71, 72, 70, 74])
years = ["2022", "2023", "2024", "2025"]

ax.plot(years, cohort_size, color="#4477AA", marker="o", label="Cohort size")
ax.set_ylabel("Cohort size", color="#4477AA")

ax2 = ax.twinx()
ax2.plot(years, avg_score, color="#EE6677", marker="o", label="Average score")
ax2.set_ylabel("Average score", color="#EE6677")

ax.set_title("Two independently scaled axes invite a false correlation");

Common Mistake: Dual axes with independent scales

Cohort size and average score both happen to rise together in this chart, but each axis was scaled independently to make the two lines fit nicely, which is exactly what makes dual axes dangerous: you can almost always pick scales that make two unrelated series appear to move together. If two series genuinely need comparing, plot their percent change from a common baseline on one shared axis instead, or use two separate charts stacked vertically with the same x-axis.

enrollment = pd.Series(
    {"ML": 145, "Data Structures": 132, "Statistics": 98, "Databases": 41, "Networks": 28, "Ethics": 15}
)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].pie(enrollment.values, labels=enrollment.index, autopct="%1.0f%%")
axes[0].set_title("Pie: hard to rank six similar slices")

axes[1].barh(enrollment.index[::-1], enrollment.values[::-1], color="#4477AA")
axes[1].set_title("Bar: ranking is immediate")

fig.tight_layout()

Common Mistake: Too many pie slices

Six categories is already past where a pie chart works: Ethics and Networks are visibly different bars, but as pie slices they are nearly impossible to rank by eye, exactly the angle-judgment weakness Cleveland and McGill measured. A horizontal bar chart, sorted by value, answers “which course has the most enrollment” instantly. Reserve pie charts for two or three categories where one slice is clearly dominant, and prefer a bar chart almost everywhere else.

3. When a Table Beats a Chart

A chart is for spotting a pattern at a glance. A table is for looking up an exact number, or comparing many numbers a reader needs to cite precisely, such as a results appendix someone will quote in a report. This project’s ark.plot.gt_style module wraps great_tables with the same brand colours used everywhere else, so a results table looks like it belongs in the same document as the chart next to it:

from great_tables import GT, md as gt_md

from ark.plot.gt_style import themed_gt

summary = results.groupby("course")["exam_score"].agg(mean="mean", std="std", n="count").round(1).reset_index()

table = themed_gt(
    GT(summary)
    .tab_header(title=gt_md("**Exam Score Summary**"), subtitle="Mean, spread, and sample size per course")
    .cols_label(course="Course", mean="Mean", std="Std. Dev.", n="N"),
    n_rows=len(summary),
)
table

Course	Mean	Std. Dev.	N
Exam Score Summary
Mean, spread, and sample size per course
Data Structures	59.7	14.6	400
Databases	59.6	15.6	400
Linear Algebra	59.7	15.3	400
Machine Learning	60.8	15.2	400
Python Programming	59.4	15.4	400
Statistics	59.8	14.4	400

Pro Tip: Pair a chart with a table, do not choose between them

A results section in a report often wants both: the histogram from Part 5 to show the shape of the distribution, and a table like the one above to give the exact numbers someone will want to quote. Neither replaces the other.

4. From Default to Branded

Parts 5 and 6 used each library’s own defaults on purpose, so the charts in those notebooks teach the mechanics without a styling layer in the way. This project’s ark.plot module exists precisely so you do not have to restate the same colours, fonts, and spacing on every single chart by hand. Compare a lets-plot chart with and without it:

from lets_plot import LetsPlot, aes, geom_histogram, gggrid, ggplot, ggsize

LetsPlot.setup_html()

default_chart = ggplot(results, aes(x="exam_score", fill="course")) + geom_histogram(bins=20, alpha=0.6)
default_chart + ggsize(450, 300)

from lets_plot import labs, scale_fill_manual

from ark.plot.theme import modern_theme, pro_colors

branded_chart = (
    ggplot(results, aes(x="exam_score", fill="course"))
    + geom_histogram(bins=20, alpha=0.85)
    + scale_fill_manual(values=pro_colors)
    + labs(
        title="Exam score distribution by course",
        x="Exam score (0-100)",
        y="Number of students",
        fill="Course",
    )
    + modern_theme(grid=True)
)
branded_chart + ggsize(450, 300)

modern_theme() is one more + layer, exactly like every geom_*() in Part 6. scale_fill_manual(values=pro_colors) pins each course to the brand palette, and labs() replaces the auto-generated column names with reader-facing text. gggrid() puts both charts side by side in one output, which is the cleanest way to show a before/after comparison in a single cell:

gggrid([default_chart + ggsize(580, 380), branded_chart + ggsize(580, 380)], ncol=1)

The matplotlib equivalent, ark.plot.matplot_theme.configure_matplotlib_style(), works differently: it updates matplotlib’s global rcParams, so it applies to every chart drawn after you call it, not just one. That is the right trade-off for a notebook or script that draws many charts you want to look consistent, at the cost of not being able to mix styled and unstyled charts in the same figure:

from ark.plot.matplot_theme import configure_matplotlib_style

configure_matplotlib_style(font_size=10, fig_width=5)

fig, ax = plt.subplots()
ax.hist(results["exam_score"], bins=20)
ax.set_xlabel("Exam score")
ax.set_ylabel("Number of students")
ax.set_title("Same chart as Part 5, after configure_matplotlib_style()");

Activity 1 - Brand a Boxplot

Goal: Rebuild Part 5’s seaborn boxplot of exam_score by course, but after calling configure_matplotlib_style() from this section. No other code changes: the same sns.boxplot() call should now pick up the brand colour cycle automatically.

import seaborn as sns

fig, ax = plt.subplots()
sns.boxplot(data=results, x="course", y="exam_score", ax=ax)
ax.set_title(...)

import seaborn as sns

# TODO: boxplot of exam_score by course, styled by the already-active house theme
...

Ellipsis

Pro Tip: A house theme is a contract, not a one-off setting

The value of ark.plot is not any single colour choice, it is that every chart in this project, in any notebook, in any report, reaches for the same module instead of redefining its own palette. When the brand colours change, they change in one file.

5. Preattentive Attributes: What the Brain Sees First

Before you consciously read a chart, your visual system has already processed certain features. These are called preattentive attributes: colour hue, luminance, shape, orientation, size, position, and enclosure. They register in under 250 milliseconds, before attention is directed.

The design implication is simple but easy to violate: use at most one preattentive attribute for emphasis per chart. Two or more compete for attention and cancel each other out.

Key Concept: Preattentive Attributes

Visual properties processed before conscious attention — colour hue, luminance, shape, orientation, size, position, and motion. The practical rule: encode at most one categorical dimension preattentively, and make sure it carries information, not decoration.

Reference: Ware, C. (2012). Information Visualization: Perception for Design, 3rd ed.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df7 = pd.read_csv("data/university_analytics.csv")
course_means = df7.groupby("course")["final_score"].mean().sort_values()

# Left: every bar the same colour — no preattentive signal
# Right: one bar highlighted with colour — attention goes there instantly
TARGET = "Machine Learning"

fig, axes = plt.subplots(1, 2, figsize=(11, 3.8))

for ax, use_highlight in zip(axes, [False, True], strict=False):
    colors = (
        ["#009E73" if c == TARGET else "#ADB5BD" for c in course_means.index]
        if use_highlight
        else ["#ADB5BD"] * len(course_means)
    )
    ax.barh(course_means.index, course_means.values, color=colors)
    ax.set_xlabel("Mean final score")
    ax.spines[["top", "right"]].set_visible(False)
    ax.set_xlim(0, 100)
    title = "No preattentive signal" if not use_highlight else f"Colour highlights {TARGET!r}"
    ax.set_title(title, fontsize=11, fontweight="bold", color="#1E293B")

fig.suptitle("Same data — preattentive colour changes where the eye goes first", fontsize=10, color="#6B7280")
fig.tight_layout()
plt.show()

Activity 5 - Highlight by Region

Goal: Build a horizontal bar chart of mean midterm_score by region. Highlight the region with the highest mean in #0369A1 and colour every other bar #ADB5BD. Add a concise title stating which region leads.

region_means = df7.groupby("region")["midterm_score"].mean().sort_values()
# ... build the chart with one highlighted bar ...

# TODO: highlight the top-scoring region in the midterm score chart
...

6. Annotation as Narrative

A chart without annotation forces viewers to form their own interpretation. Annotation is where the “so what?” lives — the one sentence that turns a picture into an argument.

Effective annotation does three things: 1. Points at the specific data feature that supports the message (arrow or callout). 2. States that feature in plain language, not chart jargon. 3. Uses restraint: one annotation per message, not one per data point.

Key Concept: Annotation as the “So What?”

Knaflic’s rule (2015): “the most important single thing you can do to improve your graph is to add a descriptive title.” Add arrows and labels that direct attention to the exact point you want the viewer to remember. A good annotation answers: what should I notice here, and why does it matter?

fig, axes = plt.subplots(1, 2, figsize=(12, 4.2))

# ── Left: unannotated — what's the takeaway? ──────────────────────────────────
program_means = df7.groupby("program")["final_score"].mean().sort_values()

for ax in axes:
    bars = ax.bar(program_means.index, program_means.values, color=["#ADB5BD"] * len(program_means))
    ax.set_ylabel("Mean final score")
    ax.spines[["top", "right"]].set_visible(False)
    ax.set_ylim(0, 85)
    ax.tick_params(axis="x", rotation=15)

axes[0].set_title("Without annotation\n(viewer guesses the message)", fontsize=10, fontweight="bold")

# ── Right: annotated — narrative is explicit ──────────────────────────────────
best_idx = list(program_means.index).index(program_means.idxmax())
axes[1].patches[best_idx].set_facecolor("#0369A1")

axes[1].annotate(
    f"Data Science leads\nat {program_means.max():.1f}",
    xy=(best_idx, program_means.max()),
    xytext=(best_idx + 0.6, program_means.max() - 4),
    fontsize=9,
    color="#0369A1",
    fontweight="bold",
    arrowprops={"arrowstyle": "->", "color": "#0369A1", "lw": 1.4},
    ha="left",
)
axes[1].set_title("With annotation\n(message is explicit)", fontsize=10, fontweight="bold")

fig.suptitle("Annotation turns a picture into an argument", fontsize=10, color="#6B7280")
fig.tight_layout()
plt.show()

Activity 6 - Annotate a Trend

Goal: Plot mean final_score by semester as a line chart. Add ax.annotate() to flag the semester with the highest mean and write a one-sentence insight as the chart title.

sem_means = df7.groupby("semester")["final_score"].mean()
# ... line chart + annotation ...

# TODO: line chart of final_score by semester with an annotation on the peak semester
...

7. The Datasaurus Dozen: Always Visualize

In 1973 Francis Anscombe constructed four datasets with nearly identical summary statistics — same mean, variance, and correlation — but radically different visual shapes. In 2017, Matejka and Fitzmaurice extended this to thirteen datasets, including a dinosaur, which gave the collection its name: the Datasaurus Dozen.

The lesson is not subtle: summary statistics alone hide the shape of your data. A mean and a standard deviation cannot distinguish a normal distribution from a bimodal one, a dinosaur, or a circle.

Key Concept: Datasaurus Dozen

Matejka, J. & Fitzmaurice, G. (2017). Same stats, different graphs. Proceedings of CHI 2017. The original Anscombe’s Quartet dates from 1973. Both show the same principle: visualize before summarizing. A modern corollary for ML: never report only RMSE without looking at a residual plot.

# Demonstrate "same stats, different shapes" with two student subsets
# that have matching means and standard deviations but different distributions.

rng = np.random.default_rng(7)

# Group A: normally distributed scores (typical class)
n = 200
group_a = pd.DataFrame(
    {
        "group": "Group A (normal)",
        "midterm": rng.normal(68, 12, n).clip(20, 100),
        "final": rng.normal(70, 12, n).clip(20, 100),
    }
)

# Group B: bimodal — two subpopulations merged (e.g. two campuses with different resources)
group_b = pd.DataFrame(
    {
        "group": "Group B (bimodal)",
        "midterm": np.concatenate([rng.normal(55, 6, n // 2), rng.normal(82, 6, n // 2)]),
        "final": np.concatenate([rng.normal(57, 6, n // 2), rng.normal(84, 6, n // 2)]),
    }
)

combined = pd.concat([group_a, group_b], ignore_index=True)

# Summary stats — almost identical
print("Summary statistics:")
print(combined.groupby("group")[["midterm", "final"]].describe().round(1).to_string())
print()

fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=False)

for ax, (name, grp) in zip(axes, combined.groupby("group"), strict=False):
    ax.scatter(grp["midterm"], grp["final"], alpha=0.4, s=18, color="#0369A1")
    ax.set_xlabel("Midterm score")
    ax.set_ylabel("Final score")
    ax.set_title(name, fontweight="bold")
    ax.spines[["top", "right"]].set_visible(False)
    # Annotate means
    ax.axvline(grp["midterm"].mean(), color="#D97706", lw=1.2, ls="--", label=f"mean={grp['midterm'].mean():.1f}")
    ax.axhline(grp["final"].mean(), color="#D97706", lw=1.2, ls="--")
    ax.legend(fontsize=8)

fig.suptitle("Same means and standard deviations — completely different structures", fontsize=10, color="#6B7280")
fig.tight_layout()
plt.show()

Summary statistics:
                  midterm                                            final                                          
                    count  mean   std   min   25%   50%   75%   max  count  mean   std   min   25%   50%   75%   max
group                                                                                                               
Group A (normal)    200.0  66.4  10.5  37.8  60.2  66.4  73.9  94.9  200.0  69.0  11.7  31.0  60.7  69.3  76.3  97.1
Group B (bimodal)   200.0  67.4  14.8  38.1  53.2  68.2  81.4  93.2  200.0  70.4  14.5  40.4  56.8  71.0  83.0  97.7

Common Mistake: Reporting only mean ± std for a bimodal distribution

Group B above has the same mean and standard deviation as Group A. A table of statistics alone would show no difference. The scatter plot reveals that Group B is actually two separate subpopulations — a finding with direct intervention implications.

Rule: for any continuous variable you care about, always plot a distribution (histogram, KDE, strip plot, or box plot) before reporting a single number.

Activity 7 - Unmask the Hidden Structure

Goal: From university_analytics.csv, compare the distribution of final_score for two programs side-by-side using overlapping KDE plots (or box plots). Before plotting, print the mean and standard deviation for each program. Then reflect: would the statistics alone have revealed the difference?

import seaborn as sns
# ... print stats, then plot KDE for each program ...

# TODO: compare final_score distributions for two programs: print stats, then KDE plot
...

Capstone: Rescue a Misleading Chart

Below is a chart with three problems from this notebook stacked together: a truncated y-axis, no axis labels, and matplotlib’s unbranded defaults. Fix all three, then add one caption sentence stating the chart’s single message.

# The chart to rescue. Run this first to see what is wrong with it.
fig, ax = plt.subplots(figsize=(5, 3))
ax.bar(course_means.index, course_means.values, color=["#4477AA", "#EE6677", "#228833"])
ax.set_ylim(65, 78)
fig.tight_layout()

Capstone Exercise - Fix the Chart

Goal:

Fix the y-axis to start at 0
Add an x and y axis label
Apply the house style with configure_matplotlib_style()
Add one print statement with a one-sentence caption stating the chart’s single message

Hint: configure_matplotlib_style() only affects charts drawn after it runs, so call it before plt.subplots(), not after.

# TODO: rescue the chart
...

print("Caption: ...")

Caption: ...

Resource	Why it matters
Knaflic, C.N. (2015). Storytelling with Data. Wiley.	The most practical data visualisation book for analysts; the annotated chart examples in Chapter 5 are worth the price alone
Wilke, C.O. (2019). Fundamentals of Data Visualization. O’Reilly.	Free at clauswilke.com/dataviz — Chapter 17 on “proportional ink” and Chapter 18 on chart junk directly support Sections 1–2 of this notebook
Schwabish, J. (2021). Better Data Visualizations. Columbia University Press.	Covers annotation, layout, and the “sorta spaghetti” problem; strong on the gap between what analysts produce and what executives read
Cleveland, W.S. & McGill, R. (1984). Graphical perception. Journal of the American Statistical Association 79(387), 531–554.	The empirical study that established the encoding hierarchy used in Section 1
Matejka, J. & Fitzmaurice, G. (2017). Same stats, different graphs. CHI 2017.	The Datasaurus Dozen paper — free PDF on the ACM DL

Summary

Concept	Key rule
Chart selection	Let the question pick the chart; position beats angle beats area for accurate reading
Data-ink ratio	Every gridline, border, and redundant legend competes with your data
One message per chart	A third colour dimension or a second y-axis doubles the cognitive load
Table vs chart	Use a table when the reader needs exact numbers; use a chart when the pattern matters
House style	Apply the house theme once and leave it — consistency is itself a signal of quality
Preattentive attributes	Encode at most one categorical dimension preattentively; two or more cancel out
Annotation as narrative	The title and annotation carry the “so what?” — write them before you design the chart
Datasaurus	Always visualise before reporting summary statistics — same stats, different shapes