from lets_plot import LetsPlot
LetsPlot.setup_html()Part 6: Grammar of Graphics with Lets-Plot
DS-MLOps Python Foundations
Python 3.12+ | Author: Anthony Faustine
Before you begin
This notebook assumes you have completed Part 5 (05-matplotlib.ipynb). If you have not, start there: this notebook rebuilds several of its charts on purpose, so the contrast between the two ways of thinking about a plot is concrete rather than abstract.
Part 5 was imperative: you told matplotlib exactly which method to call, in which order, for every piece of the chart. This part is declarative: you describe the data and the mapping you want, and the library works out how to draw it. This style is called the grammar of graphics, and lets-plot implements it in Python the way ggplot2 implements it in R. Part 7 (07-data-storytelling.ipynb) covers what makes a chart good and applies this project’s house style to both libraries.
Callout markers used throughout this notebook are explained on the book cover page.
0. Why Grammar of Graphics?
You have already drawn scatter plots with Matplotlib. You called ax.scatter(x, y, c=color, s=size). If you wanted a different geom, you called a different method. If you wanted a trend line, you called yet another. The plot grew by accumulating function calls, each one doing something slightly different.
There is another way to think about it. A chart is not a list of drawing commands: it is a mapping from data columns to visual channels (position, colour, shape, size). State that mapping once, and the library figures out how to draw it. Add a layer (a trend line, a rug, a label) and you extend the mapping rather than calling a new drawing function. This is the Grammar of Graphics, introduced by Leland Wilkinson in 1999 and popularised by Hadley Wickham’s ggplot2 for R.
Lets-Plot (lets-plot.org) is JetBrains’ Python implementation of the same grammar. Its API mirrors ggplot2 so closely that R users can read Lets-Plot code without a translation guide. It renders to HTML in Jupyter, to PNG for reports, and (in its Pro edition) to interactive Datalore dashboards.
Alternatives that use the same grammar
| Library | Language | Notes |
|---|---|---|
| ggplot2 (ggplot2.tidyverse.org) | R | The original; Lets-Plot mirrors it deliberately |
| plotnine (plotnine.org) | Python | ggplot2 port; similar API, Matplotlib backend |
| Lets-Plot (lets-plot.org) | Python / Kotlin | HTML-first output, fast, maintained by JetBrains |
| Vega-Altair (altair-viz.github.io) | Python | Different grammar (Vega-Lite), interactive |
Already in your environment
uv add lets-plot # for a standalone project1. Declarative vs. Imperative
In Part 5, building a scatter plot meant calling ax.scatter() directly: you were the one deciding which function draws which shape. The grammar of graphics flips this around. You describe what the data means (this column is the x position, this column is the colour) and the library decides how to draw it. The same description works whether you add one point or one million, and stays valid even if you later change the chart type entirely.
Rebuild the same dataset from Part 5: exam results across three courses and two semesters.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
courses = np.array(["Machine Learning", "Data Structures", "Statistics"])
semesters = np.array(["Fall 2024", "Spring 2025"])
n_per_group = 60
course_col = np.repeat(courses, n_per_group * len(semesters))
semester_col = np.tile(np.repeat(semesters, n_per_group), len(courses))
course_base = {"Machine Learning": 68, "Data Structures": 74, "Statistics": 71}
semester_bump = {"Fall 2024": 0, "Spring 2025": 4}
exam_score = np.array(
[rng.normal(course_base[c] + semester_bump[s], 10) for c, s in zip(course_col, semester_col, strict=True)]
).clip(0, 100)
study_hours = rng.uniform(0, 25, size=len(course_col))
results = pd.DataFrame(
{
"course": course_col,
"semester": semester_col,
"exam_score": exam_score,
"study_hours": study_hours,
}
)
results.head()| course | semester | exam_score | study_hours | |
|---|---|---|---|---|
| 0 | Machine Learning | Fall 2024 | 71.047171 | 3.940291 |
| 1 | Machine Learning | Fall 2024 | 57.600159 | 3.694584 |
| 2 | Machine Learning | Fall 2024 | 75.504512 | 23.403187 |
| 3 | Machine Learning | Fall 2024 | 77.405647 | 10.947601 |
| 4 | Machine Learning | Fall 2024 | 48.489648 | 9.582996 |
Here is the Part 5 scatter plot (study hours vs. exam score) again, this time declaratively:
from lets_plot import aes, geom_point, ggplot
ggplot(results, aes(x="study_hours", y="exam_score")) + geom_point(alpha=0.4) Key Concept: The Grammar of Graphics
Every lets-plot chart is built from the same three pieces, combined with +: ggplot(data, aes(…)) declares the dataset and which columns map to which visual property, and one or more geom_() layers say what shape to draw with that mapping. Change the geom_() and the same aes() mapping produces a completely different chart type, often without touching anything else.
2. The Grammar: ggplot, aes, geom
aes() (short for “aesthetics”) maps DataFrame columns to visual properties: x, y, color, fill, size. The geom_*() you add decides what shape represents each row. Swapping geom_point() for geom_line() or geom_bar() is often the only change needed to turn one chart type into another:
from lets_plot import geom_bar
course_means = results.groupby("course", as_index=False)["exam_score"].mean()
ggplot(course_means, aes(x="course", y="exam_score")) + geom_bar(stat="identity", fill="#4477AA")stat="identity" tells geom_bar() to plot the exam_score column exactly as given, rather than its default behaviour of counting rows per category. Compare this to Part 5’s bar chart: the data preparation (groupby().mean()) is identical, only the drawing step changed shape.
position="dodge" places bars for each group side by side instead of stacking them, useful when you want to compare values across two grouping variables at once. The data needs a category column on x and a group column on fill:
course_semester_means = results.groupby(["course", "semester"], as_index=False)["exam_score"].mean()
(
ggplot(course_semester_means, aes(x="course", y="exam_score", fill="semester"))
+ geom_bar(stat="identity", position="dodge")
) Activity 1 - Histogram, Declaratively
exam_score using geom_histogram(). Map x=“exam_score” in aes(), and pass bins=20 to geom_histogram().
ggplot(results, aes(x="exam_score")) + geom_histogram(bins=20, fill="#4477AA")
# TODO: build the histogram described above
...3. Mapping vs. Setting
aes(color="course") maps the course column to colour: each course gets its own colour, chosen automatically, with a legend. color="#4477AA" sets every point to the exact same fixed colour, with no legend, because there is nothing left to distinguish. Confusing the two is the single most common mistake when learning any grammar of graphics, in Python or R:
# Mapping: course determines colour, one colour per course, with a legend
ggplot(results, aes(x="study_hours", y="exam_score", color="course")) + geom_point(alpha=0.6)# Setting: every point is the same fixed colour, no legend needed
ggplot(results, aes(x="study_hours", y="exam_score")) + geom_point(color="#4477AA", alpha=0.6) Common Mistake: Putting a fixed value inside aes()
aes(color=“#4477AA”) tries to map the literal string “#4477AA” as if it were a column name. Lets-plot will either error or, worse, silently treat it as a constant category and draw a one-entry legend that says #4477AA. A fixed value is a setting and belongs as a plain keyword argument to the geom_*(), outside aes(). A column name that should vary per row is a mapping and belongs inside aes().
4. Layering: Raw Data and a Statistical Summary Together
Because every geom_*() is its own layer, you can stack a raw-data layer and a computed-summary layer on the same aes() mapping. geom_smooth() fits a trend line to the data it is given, with a shaded confidence band, entirely declaratively:
from lets_plot import geom_smooth
(
ggplot(results, aes(x="study_hours", y="exam_score"))
+ geom_point(alpha=0.3, color="#4477AA")
+ geom_smooth(color="#EE6677", se=True)
) Pro Tip: Use method=‘lm’ when you need a straight-line fit
geom_smooth() uses a LOESS (locally weighted) smoother by default, which follows local curves in the data. When you specifically want a linear regression fit, pass method=‘lm’. The confidence band and se=True/False control both methods the same way:
# Default LOESS — follows curves geom_smooth(se=True) # Linear (OLS) — forces a straight line, shows the parametric fit geom_smooth(method='lm', se=True)
The LOESS smoother needs statsmodels installed; the linear one does not. Both display the 95% confidence band when se=True.
geom_density() is the smooth-curve equivalent of a histogram, useful when comparing several distributions that would otherwise overlap into an unreadable stack of bars:
from lets_plot import geom_density
ggplot(results, aes(x="exam_score", fill="course")) + geom_density(alpha=0.4)geom_boxplot() summarises a distribution as median, quartiles, and outliers per group, the declarative equivalent of Part 5’s sns.boxplot(). Because it is just another geom_*(), it composes with aes() and faceting exactly like any other layer:
from lets_plot import geom_boxplot
ggplot(results, aes(x="course", y="exam_score", fill="course")) + geom_boxplot(alpha=0.7) Pro Tip: Swap geom_boxplot() for geom_violin() when shape matters
A boxplot gives you five numbers: minimum, first quartile, median, third quartile, and maximum. A violin gives you the full density shape on both sides of the axis, which reveals skew or multiple peaks that the box hides. The aes() mapping stays identical: just change the geom_*().
Pro Tip: Layers compose in the order you add them
geom_point() + geom_smooth() draws points first, then the trend line on top. Reversing the order draws the line first and the points on top of it. This matters most when a layer has a solid fill that would otherwise hide whatever was drawn before it.
Pro Tip: Add marginal distribution plots with ggmarginal()
Lets-Plot’s ggmarginal() wraps any scatter plot with histograms or density curves along the x and y margins, letting you see both the bivariate relationship and the individual distributions without a separate figure:
from lets_plot import ggmarginal
p = (
ggplot(results, aes(x="study_hours", y="exam_score", color="course"))
+ geom_point(alpha=0.4)
+ geom_smooth(method="lm", se=False)
)
ggmarginal(p, type="density")
type accepts “density” (KDE curves), “histogram”, or “boxplot”. The marginals automatically inherit the color mapping so each course gets its own density curve.
5. Faceting: One Plot, Many Panels
Part 5’s three-panel histogram needed a manual loop over axes.flat, plus sharey=True to keep the comparison fair. facet_wrap() does both in one line, and shares scales across panels by default, the opposite of matplotlib’s default and exactly what you want for a fair comparison:
from lets_plot import facet_wrap, geom_histogram, ggsize
(
ggplot(results, aes(x="exam_score"))
+ geom_histogram(bins=15, fill="#4477AA")
+ facet_wrap(facets="course")
+ ggsize(700, 250)
) Key Concept: Faceting replaces a loop with a declaration
facet_wrap(facets=“course”) splits the data by the course column and draws one panel per group, using the exact same aes() and geom_*() for every panel. There is no loop to write and no risk of accidentally giving one panel a different scale than the others, the bug from Part 5’s Common Mistake callout. Add a second facet variable with facet_grid() for a full grid of panels.
When you have two grouping variables, facet_grid() builds a full grid of panels: one row per level of one variable and one column per level of the other. For this dataset, rows are semesters and columns are courses:
from lets_plot import facet_grid
(
ggplot(results, aes(x="exam_score"))
+ geom_histogram(bins=12, fill="#4477AA")
+ facet_grid(y="semester", x="course")
+ ggsize(850, 450)
) Activity 2 - Facet by Semester
study_hours vs. exam_score, coloured by course, faceted into one panel per semester.
ggplot(results, aes(x="study_hours", y="exam_score", color="course")) \
+ geom_point(alpha=0.5) \
+ facet_wrap(facets="semester")
# TODO: scatter plot, coloured by course, faceted by semester
...6. Titles, Labels, and Scales
Every chart so far has used whatever axis labels and legend titles lets-plot generates by default: column names from the DataFrame, which are fine for exploration but not for sharing. labs() replaces all of them in one layer. scale_color_manual() and scale_fill_manual() give you exact control over which colour maps to which group. Passing pro_colors from ark.plot.theme uses the project’s brand palette, the same one modern_theme() applies globally in Part 7:
from lets_plot import labs, scale_color_manual
from ark.plot.theme import pro_colors
(
ggplot(results, aes(x="study_hours", y="exam_score", color="course"))
+ geom_point(alpha=0.5)
+ geom_smooth(se=False)
+ labs(
title="Study hours versus exam score",
subtitle="Each point is one student; lines show the per-course trend",
x="Weekly study hours",
y="Exam score (0-100)",
color="Course",
)
+ scale_color_manual(values=pro_colors)
) Key Concept: labs() renames anything auto-generated from a column name
labs(title=…, x=…, y=…, color=…) is one more + layer. The named argument matches the aesthetic: use color= when you have aes(color=“course”) and fill= when you have aes(fill=“course”). pro_colors from ark.plot.theme is the project’s brand palette; passing it to scale_color_manual() connects this chart to the same colour system that modern_theme() applies globally in Part 7.
Activity 3 - Label the Density Chart
geom_density() chart from Section 4 and add a proper title, axis labels, and legend title with labs().
ggplot(results, aes(x="exam_score", fill="course")) + geom_density(alpha=0.4) \
+ labs(title=..., x=..., y=..., fill=...)
# TODO: add labs() to the density chart from Section 4
...Capstone: The Three-Panel Report, Declaratively
Part 5’s capstone built a three-panel course report with a manual (1, 3) grid of Axes. Here, build the same idea (one histogram per course) as a single faceted chart instead of three separate panels assembled by hand.
Capstone Exercise - Faceted Course Report
exam_score, faceted by course, with geom_vline() marking the overall mean score so each course’s panel can be compared against it.
overall_mean = results["exam_score"].mean()
ggplot(results, aes(x="exam_score")) \
+ geom_histogram(bins=15, fill="#4477AA") \
+ geom_vline(xintercept=overall_mean, color="#EE6677", linetype="dashed") \
+ facet_wrap(facets="course") \
+ ggsize(700, 250)
Hint: geom_vline() takes a fixed xintercept, a setting, not a mapping, so it goes outside aes() just like the fixed colours in Sec. 3.
from lets_plot import geom_vline
overall_mean = results["exam_score"].mean()
# TODO: faceted histogram with a reference line at overall_mean
...Ellipsis
Further Reading
| Resource | Why it matters |
|---|---|
| Wilkinson, L. (2005). The Grammar of Graphics, 2nd ed. Springer. | The theory behind the layered grammar; Lets-Plot, ggplot2, and Vega-Altair are all implementations of this framework |
| Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics 19(1), 3–28. | The ggplot2 paper; most directly explains geom_*, aes(), and stat_* concepts that Lets-Plot mirrors |
| Lets-Plot documentation | The primary API reference; the gallery is the fastest way to find the right geom_* |
| ggplot2 book (Wickham, 2016) | Free online — Lets-Plot’s API maps closely to ggplot2, so this book is directly useful |
Summary
| Concept | Key rule |
|---|---|
| Declarative vs. imperative | Describe the mapping, let the library decide how to draw it |
ggplot(data, aes(...)) |
Declares the dataset and which columns map to which visual property |
geom_*() |
Decides what shape represents each row; swap it to change chart type |
| Mapping | aes(color="course"), a column that varies per row, gets a legend |
| Setting | color="#4477AA", a fixed value, no legend |
position="dodge" |
Places bars for each group side by side instead of stacking them |
| Layering | Stack geom_*() calls with +; later layers draw on top of earlier ones |
geom_smooth() / geom_density() |
Statistical summaries, computed declaratively from raw data |
geom_boxplot() / geom_violin() |
Distribution per group: box for five-number summary, violin for full shape |
facet_wrap() |
One declaration replaces a manual subplot loop, with shared scales by default |
facet_grid(rows=..., cols=...) |
Full grid of panels for two grouping variables |
labs() |
Replaces any auto-generated axis label or legend title in one layer |
scale_color_manual() |
Maps groups to specific colours, overriding the default palette |
Next: 07-data-storytelling.ipynb, covering what makes a chart good and applying this project’s house style to both matplotlib and lets-plot.