Part 6: Grammar of Graphics with Lets-Plot

Open In Colab Download Notebook

DS-MLOps Python Foundations

Python 3.12+ | Author: Anthony Faustine

Before you begin

This notebook assumes you have completed Part 5 (05-matplotlib.ipynb). If you have not, start there: this notebook rebuilds several of its charts on purpose, so the contrast between the two ways of thinking about a plot is concrete rather than abstract.

Part 5 was imperative: you told matplotlib exactly which method to call, in which order, for every piece of the chart. This part is declarative: you describe the data and the mapping you want, and the library works out how to draw it. This style is called the grammar of graphics, and lets-plot implements it in Python the way ggplot2 implements it in R. Part 7 (07-data-storytelling.ipynb) covers what makes a chart good and applies this project’s house style to both libraries.

Topic Why it matters
Declarative vs. imperative The mental model shift that makes the rest of this notebook click
ggplot + aes + geom The three pieces every lets-plot chart is built from
Mapping vs. setting The single most common mistake in any grammar of graphics
Layering Adding a statistical summary on top of raw data, declaratively
Faceting One line to replace a whole manual subplot loop
Titles, labels, and scales Communicating clearly: naming axes, renaming legends, controlling colours

Callout markers used throughout this notebook are explained on the book cover page.

By the end of Part 6 you will be able to:

# Skill Covered in
1 Explain the difference between declarative and imperative plotting Sec. 1
2 Build a chart from ggplot(), aes(), and a geom_*() Sec. 2
3 Distinguish mapping a variable from setting a fixed value Sec. 3
4 Layer a statistical summary on top of raw data Sec. 4
5 Facet one chart into many panels instead of looping over subplots Sec. 5
6 Add titles, axis labels, and control colours with labs() and scale functions Sec. 6

0. Why Grammar of Graphics?

You have already drawn scatter plots with Matplotlib. You called ax.scatter(x, y, c=color, s=size). If you wanted a different geom, you called a different method. If you wanted a trend line, you called yet another. The plot grew by accumulating function calls, each one doing something slightly different.

There is another way to think about it. A chart is not a list of drawing commands: it is a mapping from data columns to visual channels (position, colour, shape, size). State that mapping once, and the library figures out how to draw it. Add a layer (a trend line, a rug, a label) and you extend the mapping rather than calling a new drawing function. This is the Grammar of Graphics, introduced by Leland Wilkinson in 1999 and popularised by Hadley Wickham’s ggplot2 for R.

Lets-Plot (lets-plot.org) is JetBrains’ Python implementation of the same grammar. Its API mirrors ggplot2 so closely that R users can read Lets-Plot code without a translation guide. It renders to HTML in Jupyter, to PNG for reports, and (in its Pro edition) to interactive Datalore dashboards.

Alternatives that use the same grammar

Library Language Notes
ggplot2 (ggplot2.tidyverse.org) R The original; Lets-Plot mirrors it deliberately
plotnine (plotnine.org) Python ggplot2 port; similar API, Matplotlib backend
Lets-Plot (lets-plot.org) Python / Kotlin HTML-first output, fast, maintained by JetBrains
Vega-Altair (altair-viz.github.io) Python Different grammar (Vega-Lite), interactive

Already in your environment

uv add lets-plot          # for a standalone project

1. Declarative vs. Imperative

In Part 5, building a scatter plot meant calling ax.scatter() directly: you were the one deciding which function draws which shape. The grammar of graphics flips this around. You describe what the data means (this column is the x position, this column is the colour) and the library decides how to draw it. The same description works whether you add one point or one million, and stays valid even if you later change the chart type entirely.

from lets_plot import LetsPlot

LetsPlot.setup_html()

Rebuild the same dataset from Part 5: exam results across three courses and two semesters.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

courses = np.array(["Machine Learning", "Data Structures", "Statistics"])
semesters = np.array(["Fall 2024", "Spring 2025"])

n_per_group = 60
course_col = np.repeat(courses, n_per_group * len(semesters))
semester_col = np.tile(np.repeat(semesters, n_per_group), len(courses))

course_base = {"Machine Learning": 68, "Data Structures": 74, "Statistics": 71}
semester_bump = {"Fall 2024": 0, "Spring 2025": 4}

exam_score = np.array(
    [rng.normal(course_base[c] + semester_bump[s], 10) for c, s in zip(course_col, semester_col, strict=True)]
).clip(0, 100)
study_hours = rng.uniform(0, 25, size=len(course_col))

results = pd.DataFrame(
    {
        "course": course_col,
        "semester": semester_col,
        "exam_score": exam_score,
        "study_hours": study_hours,
    }
)
results.head()
course semester exam_score study_hours
0 Machine Learning Fall 2024 71.047171 3.940291
1 Machine Learning Fall 2024 57.600159 3.694584
2 Machine Learning Fall 2024 75.504512 23.403187
3 Machine Learning Fall 2024 77.405647 10.947601
4 Machine Learning Fall 2024 48.489648 9.582996

Here is the Part 5 scatter plot (study hours vs. exam score) again, this time declaratively:

from lets_plot import aes, geom_point, ggplot

ggplot(results, aes(x="study_hours", y="exam_score")) + geom_point(alpha=0.4)

Key Concept: The Grammar of Graphics

Every lets-plot chart is built from the same three pieces, combined with +: ggplot(data, aes(…)) declares the dataset and which columns map to which visual property, and one or more geom_() layers say what shape to draw with that mapping. Change the geom_() and the same aes() mapping produces a completely different chart type, often without touching anything else.

2. The Grammar: ggplot, aes, geom

aes() (short for “aesthetics”) maps DataFrame columns to visual properties: x, y, color, fill, size. The geom_*() you add decides what shape represents each row. Swapping geom_point() for geom_line() or geom_bar() is often the only change needed to turn one chart type into another:

from lets_plot import geom_bar

course_means = results.groupby("course", as_index=False)["exam_score"].mean()

ggplot(course_means, aes(x="course", y="exam_score")) + geom_bar(stat="identity", fill="#4477AA")

stat="identity" tells geom_bar() to plot the exam_score column exactly as given, rather than its default behaviour of counting rows per category. Compare this to Part 5’s bar chart: the data preparation (groupby().mean()) is identical, only the drawing step changed shape.

position="dodge" places bars for each group side by side instead of stacking them, useful when you want to compare values across two grouping variables at once. The data needs a category column on x and a group column on fill:

course_semester_means = results.groupby(["course", "semester"], as_index=False)["exam_score"].mean()

(
    ggplot(course_semester_means, aes(x="course", y="exam_score", fill="semester"))
    + geom_bar(stat="identity", position="dodge")
)

Activity 1 - Histogram, Declaratively

Goal: Rebuild the Part 5 histogram of exam_score using geom_histogram(). Map x=“exam_score” in aes(), and pass bins=20 to geom_histogram().
ggplot(results, aes(x="exam_score")) + geom_histogram(bins=20, fill="#4477AA")
# TODO: build the histogram described above
...

3. Mapping vs. Setting

aes(color="course") maps the course column to colour: each course gets its own colour, chosen automatically, with a legend. color="#4477AA" sets every point to the exact same fixed colour, with no legend, because there is nothing left to distinguish. Confusing the two is the single most common mistake when learning any grammar of graphics, in Python or R:

# Mapping: course determines colour, one colour per course, with a legend
ggplot(results, aes(x="study_hours", y="exam_score", color="course")) + geom_point(alpha=0.6)
# Setting: every point is the same fixed colour, no legend needed
ggplot(results, aes(x="study_hours", y="exam_score")) + geom_point(color="#4477AA", alpha=0.6)

Common Mistake: Putting a fixed value inside aes()

aes(color=“#4477AA”) tries to map the literal string “#4477AA” as if it were a column name. Lets-plot will either error or, worse, silently treat it as a constant category and draw a one-entry legend that says #4477AA. A fixed value is a setting and belongs as a plain keyword argument to the geom_*(), outside aes(). A column name that should vary per row is a mapping and belongs inside aes().

4. Layering: Raw Data and a Statistical Summary Together

Because every geom_*() is its own layer, you can stack a raw-data layer and a computed-summary layer on the same aes() mapping. geom_smooth() fits a trend line to the data it is given, with a shaded confidence band, entirely declaratively:

from lets_plot import geom_smooth

(
    ggplot(results, aes(x="study_hours", y="exam_score"))
    + geom_point(alpha=0.3, color="#4477AA")
    + geom_smooth(color="#EE6677", se=True)
)

Pro Tip: Use method=‘lm’ when you need a straight-line fit

geom_smooth() uses a LOESS (locally weighted) smoother by default, which follows local curves in the data. When you specifically want a linear regression fit, pass method=‘lm’. The confidence band and se=True/False control both methods the same way:

# Default LOESS — follows curves
geom_smooth(se=True)

# Linear (OLS) — forces a straight line, shows the parametric fit
geom_smooth(method='lm', se=True)

The LOESS smoother needs statsmodels installed; the linear one does not. Both display the 95% confidence band when se=True.

geom_density() is the smooth-curve equivalent of a histogram, useful when comparing several distributions that would otherwise overlap into an unreadable stack of bars:

from lets_plot import geom_density

ggplot(results, aes(x="exam_score", fill="course")) + geom_density(alpha=0.4)

geom_boxplot() summarises a distribution as median, quartiles, and outliers per group, the declarative equivalent of Part 5’s sns.boxplot(). Because it is just another geom_*(), it composes with aes() and faceting exactly like any other layer:

from lets_plot import geom_boxplot

ggplot(results, aes(x="course", y="exam_score", fill="course")) + geom_boxplot(alpha=0.7)

Pro Tip: Swap geom_boxplot() for geom_violin() when shape matters

A boxplot gives you five numbers: minimum, first quartile, median, third quartile, and maximum. A violin gives you the full density shape on both sides of the axis, which reveals skew or multiple peaks that the box hides. The aes() mapping stays identical: just change the geom_*().

Pro Tip: Layers compose in the order you add them

geom_point() + geom_smooth() draws points first, then the trend line on top. Reversing the order draws the line first and the points on top of it. This matters most when a layer has a solid fill that would otherwise hide whatever was drawn before it.

Pro Tip: Add marginal distribution plots with ggmarginal()

Lets-Plot’s ggmarginal() wraps any scatter plot with histograms or density curves along the x and y margins, letting you see both the bivariate relationship and the individual distributions without a separate figure:

from lets_plot import ggmarginal

p = (
    ggplot(results, aes(x="study_hours", y="exam_score", color="course"))
    + geom_point(alpha=0.4)
    + geom_smooth(method="lm", se=False)
)
ggmarginal(p, type="density")

type accepts “density” (KDE curves), “histogram”, or “boxplot”. The marginals automatically inherit the color mapping so each course gets its own density curve.

5. Faceting: One Plot, Many Panels

Part 5’s three-panel histogram needed a manual loop over axes.flat, plus sharey=True to keep the comparison fair. facet_wrap() does both in one line, and shares scales across panels by default, the opposite of matplotlib’s default and exactly what you want for a fair comparison:

from lets_plot import facet_wrap, geom_histogram, ggsize

(
    ggplot(results, aes(x="exam_score"))
    + geom_histogram(bins=15, fill="#4477AA")
    + facet_wrap(facets="course")
    + ggsize(700, 250)
)

Key Concept: Faceting replaces a loop with a declaration

facet_wrap(facets=“course”) splits the data by the course column and draws one panel per group, using the exact same aes() and geom_*() for every panel. There is no loop to write and no risk of accidentally giving one panel a different scale than the others, the bug from Part 5’s Common Mistake callout. Add a second facet variable with facet_grid() for a full grid of panels.

When you have two grouping variables, facet_grid() builds a full grid of panels: one row per level of one variable and one column per level of the other. For this dataset, rows are semesters and columns are courses:

from lets_plot import facet_grid

(
    ggplot(results, aes(x="exam_score"))
    + geom_histogram(bins=12, fill="#4477AA")
    + facet_grid(y="semester", x="course")
    + ggsize(850, 450)
)

Activity 2 - Facet by Semester

Goal: Build a scatter plot of study_hours vs. exam_score, coloured by course, faceted into one panel per semester.
ggplot(results, aes(x="study_hours", y="exam_score", color="course")) \
    + geom_point(alpha=0.5) \
    + facet_wrap(facets="semester")
# TODO: scatter plot, coloured by course, faceted by semester
...

6. Titles, Labels, and Scales

Every chart so far has used whatever axis labels and legend titles lets-plot generates by default: column names from the DataFrame, which are fine for exploration but not for sharing. labs() replaces all of them in one layer. scale_color_manual() and scale_fill_manual() give you exact control over which colour maps to which group. Passing pro_colors from ark.plot.theme uses the project’s brand palette, the same one modern_theme() applies globally in Part 7:

from lets_plot import labs, scale_color_manual

from ark.plot.theme import pro_colors

(
    ggplot(results, aes(x="study_hours", y="exam_score", color="course"))
    + geom_point(alpha=0.5)
    + geom_smooth(se=False)
    + labs(
        title="Study hours versus exam score",
        subtitle="Each point is one student; lines show the per-course trend",
        x="Weekly study hours",
        y="Exam score (0-100)",
        color="Course",
    )
    + scale_color_manual(values=pro_colors)
)

Key Concept: labs() renames anything auto-generated from a column name

labs(title=…, x=…, y=…, color=…) is one more + layer. The named argument matches the aesthetic: use color= when you have aes(color=“course”) and fill= when you have aes(fill=“course”). pro_colors from ark.plot.theme is the project’s brand palette; passing it to scale_color_manual() connects this chart to the same colour system that modern_theme() applies globally in Part 7.

Activity 3 - Label the Density Chart

Goal: Take the geom_density() chart from Section 4 and add a proper title, axis labels, and legend title with labs().
ggplot(results, aes(x="exam_score", fill="course")) + geom_density(alpha=0.4) \
    + labs(title=..., x=..., y=..., fill=...)
# TODO: add labs() to the density chart from Section 4
...

Capstone: The Three-Panel Report, Declaratively

Part 5’s capstone built a three-panel course report with a manual (1, 3) grid of Axes. Here, build the same idea (one histogram per course) as a single faceted chart instead of three separate panels assembled by hand.

Capstone Exercise - Faceted Course Report

Goal: Build one chart: a histogram of exam_score, faceted by course, with geom_vline() marking the overall mean score so each course’s panel can be compared against it.
overall_mean = results["exam_score"].mean()

ggplot(results, aes(x="exam_score")) \
    + geom_histogram(bins=15, fill="#4477AA") \
    + geom_vline(xintercept=overall_mean, color="#EE6677", linetype="dashed") \
    + facet_wrap(facets="course") \
    + ggsize(700, 250)

Hint: geom_vline() takes a fixed xintercept, a setting, not a mapping, so it goes outside aes() just like the fixed colours in Sec. 3.

from lets_plot import geom_vline

overall_mean = results["exam_score"].mean()

# TODO: faceted histogram with a reference line at overall_mean
...
Ellipsis

Further Reading

Resource Why it matters
Wilkinson, L. (2005). The Grammar of Graphics, 2nd ed. Springer. The theory behind the layered grammar; Lets-Plot, ggplot2, and Vega-Altair are all implementations of this framework
Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics 19(1), 3–28. The ggplot2 paper; most directly explains geom_*, aes(), and stat_* concepts that Lets-Plot mirrors
Lets-Plot documentation The primary API reference; the gallery is the fastest way to find the right geom_*
ggplot2 book (Wickham, 2016) Free online — Lets-Plot’s API maps closely to ggplot2, so this book is directly useful

Summary

Concept Key rule
Declarative vs. imperative Describe the mapping, let the library decide how to draw it
ggplot(data, aes(...)) Declares the dataset and which columns map to which visual property
geom_*() Decides what shape represents each row; swap it to change chart type
Mapping aes(color="course"), a column that varies per row, gets a legend
Setting color="#4477AA", a fixed value, no legend
position="dodge" Places bars for each group side by side instead of stacking them
Layering Stack geom_*() calls with +; later layers draw on top of earlier ones
geom_smooth() / geom_density() Statistical summaries, computed declaratively from raw data
geom_boxplot() / geom_violin() Distribution per group: box for five-number summary, violin for full shape
facet_wrap() One declaration replaces a manual subplot loop, with shared scales by default
facet_grid(rows=..., cols=...) Full grid of panels for two grouping variables
labs() Replaces any auto-generated axis label or legend title in one layer
scale_color_manual() Maps groups to specific colours, overriding the default palette

Next: 07-data-storytelling.ipynb, covering what makes a chart good and applying this project’s house style to both matplotlib and lets-plot.