Part 1 covers Python’s core data vocabulary: variables, types, strings, the four collection types, the standard library’s extra collections, and operators. All examples come from a single realistic scenario: a university analytics platform that tracks student performance, course enrollment, and model experiment logs.
Part 2 (02-control-flow.ipynb) continues directly from this notebook with control flow and comprehensions. Read it right after this one to complete the language foundation.
Callout markers used throughout this notebook are explained on the book cover page.
Before You Begin
What is Python?
Python is a general-purpose programming language created in 1991. A programming language is a set of rules for writing instructions a computer can execute. Unlike a spreadsheet, code lets you automate tasks, process millions of data points, and build models that learn from data.
Why Python for data science and AI?
Python was not built for data science. It became the de facto standard because of three compounding advantages:
1. Readable syntax. Python code reads closer to plain English than any other mainstream language. A data scientist can focus on the algorithm, not the language syntax. for score in scores: total += score needs no translation.
2. A world-class numerical ecosystem. The entire scientific Python stack is Python-first:
Library
What it does
NumPy
Fast multi-dimensional arrays; the foundation everything else builds on
Classical ML: linear models, trees, SVMs, pipelines
PyTorch / JAX
Deep learning: neural networks trained on GPU
HuggingFace Transformers
Large language models and vision models
Every major AI breakthrough in the last decade (ResNet, BERT, GPT, Llama) was released as Python code. Reproducing or building on that research requires Python.
3. Interactive computing with Jupyter. Jupyter notebooks let you run one cell at a time, see results immediately, and iterate without a compile step. This matches how data exploration actually works: inspect the data, transform it, visualise, repeat.
What is a Jupyter notebook?
This file is a Jupyter notebook: a document that mixes formatted text (like this paragraph) with executable code. It consists of cells:
Markdown cells (like this one): formatted text, explanations, tables, equations.
Code cells (the grey boxes below): Python code. Press Shift + Enter to run a cell; its output appears directly below.
Always run cells from top to bottom. Later cells often use variables created in earlier ones. If something breaks, use Kernel → Restart & Run All to start fresh.
What we will build together
Every example uses the same scenario: a university analytics platform tracking student scores, course enrollments, and ML experiment logs. The same data structures recur across every section so you can focus on the Python concept, not a new domain each time.
By the end of Part 2 you will have the full language foundation needed to work with real datasets using NumPy and pandas.
Python vs other languages
The best way to appreciate Python’s readability is side-by-side comparison. Here is “print Hello” in three languages:
This gap widens as programs grow. A 300-line data pipeline in Python stays readable; the equivalent in Java or C++ becomes much harder to navigate. That is why Python dominates explorative, iterative work like data science and ML.
Pro Tip: Running notebooks in VS Code
You can run all notebooks in this book inside VS Code with the IDE Setup tutorial (Part 12) guiding you through the setup. VS Code gives you IntelliSense inside cells, a Variable Inspector, and integrated git, all without leaving the editor. If you prefer JupyterLab, skip ahead — every notebook works identically there.
NoteLearning Objectives
By the end of Part 1 you will be able to:
#
Skill
Covered in
1
Annotate variables with type hints (list[float], X | None)
Sec. 1
2
Apply PEP 8 naming conventions (snake_case, PascalCase, UPPER_SNAKE)
Sec. 1.4
3
Clean, parse, and format strings
Sec. 2
4
Choose the right collection for any task
Sec. 3-7
5
Use dict | merge, TypedDict, and NamedTuple
Sec. 5, 4
6
Apply the walrus operator := where it clarifies code
Sec. 8
Note on forward references: Sections 2-7 occasionally use for loops and class definitions before they are formally introduced. for loops are covered in Part 2 (02-control-flow.ipynb); classes are covered in Part 3 (03-python-patterns.ipynb). Whenever you see for item in collection: early, read it as “repeat this block once per item.” Full explanations follow in their dedicated sections.
1. Variables, Types & Type Hints
What is a variable?
A variable is a named container that stores a value in your program’s memory. Think of it as a labelled box:
name ──► "Alice Kamau"
gpa ──► 3.85
enrolled ──► True
You create a variable with the assignment operator=:
name ="Alice Kamau"# create a box called 'name', put the value in it
⚠️ The = sign in Python means assign (store this value). It is NOT the mathematical equals sign. To check equality, use == (two equals signs).
Python’s four core types
Every value has a type: a label describing what kind of data it is:
Type
What it stores
Examples
Real-world use
int
Whole numbers
42, 2024001, -7
Student IDs, epoch counts, ranks
float
Decimal numbers
3.85, 0.001, 92.3
GPA, learning rate, accuracy
str
Text (any characters)
'Alice', "CS301"
Names, labels, file paths
bool
True or False only
True, False
“Is enrolled?”, “Did it converge?”
Python figures out the type of every value automatically. You never need to declare it.
Why add type hints?
Without hints, Python happily lets you store the wrong type in a variable:
gpa =3.85# float ✓gpa ="unknown"# str - legal but wrong! breaks any later calculation
Type hints are optional annotations that make your intent explicit so that tools can catch mistakes like the one above:
gpa: float=3.85# hint says this must be a float
The syntax is name: type = value. Hints are not enforced at runtime: Python will not crash if you violate them, but the type checker ty will report an error the moment you try to assign the wrong type.
Python 3.9+:list[int], dict[str, float] (no imports needed) Python 3.10+:float | None means “a float, or nothing” (replaces Optional[float])
Key Concept: Type Hints
A type hint annotates what type a variable should hold: name: str = ‘Alice’. Hints are read by the type checker (ty) and your editor, not enforced at runtime. Annotate every variable, function parameter, and return value you write.
Start with the simplest possible case: create a few variables and print them. No type hints yet, just the core idea of “give a name to a value”:
# Your first Python variables: no type hints yet# The = sign puts the value on the right into the name on the leftname ="Alice Kamau"# text value (str)score =87.5# decimal number (float)rank =1# whole number (int)enrolled =True# True or False (bool)# print() displays a value in the output area below this cellprint(name)print(score)print(rank)print(enrolled)
Alice Kamau
87.5
1
True
Python knows the type of every value. type() reveals it, and isinstance() tests whether a value belongs to a given type. Run this cell to confirm:
# type() tells you what Python has inferredprint(type(name)) # <class 'str'>print(type(score)) # <class 'float'>print(type(rank)) # <class 'int'>print(type(enrolled)) # <class 'bool'># Without hints, Python lets you overwrite with the wrong type: silentlyrank ="first"# rank was an int, now it's a str: Python allows itprint(f"rank is now a {type(rank).__name__}") # str!
<class 'str'>
<class 'float'>
<class 'int'>
<class 'bool'>
rank is now a str
That last reassignment (rank = 'first') would silently break any code that later tries to do arithmetic with rank. Type hints prevent this by making your intent explicit. Now see the same variables with proper annotations:
# --- Student enrollment record ---student_id: int=2024001full_name: str="Maria Garcia"gpa: float=3.85is_enrolled: bool=Truescholarship_amount: float|None=None# union type: float or None (Python 3.10+)print(f"Student : {full_name} (ID: {student_id})")print(f"GPA : {gpa} Enrolled: {is_enrolled}")print(f"Scholar.: {scholarship_amount}")
Run this to see Python’s runtime type information. isinstance() is preferred over type() because it handles class hierarchies. bool is a subclass of int, so isinstance(True, int) returns True:
# isinstance() is preferred over type() for checks: handles subclassesprint(f"type(gpa) -> {type(gpa)}")print(f"isinstance(gpa, float) -> {isinstance(gpa, float)}")print(f"isinstance(gpa, int | float) -> {isinstance(gpa, int|float)}")
Python also has a built-in complex number type, used in signal processing and Fourier analysis:
# complex numbers: real + imaginary partsfrequency: complex=3+2j# j is the imaginary unitprint(f"complex : {frequency}")print(f"real part : {frequency.real}")print(f"imag part : {frequency.imag}")print(f"magnitude : {abs(frequency):.3f}") # |z| = sqrt(real² + imag²)
complex : (3+2j)
real part : 3.0
imag part : 2.0
magnitude : 3.606
Pro Tip: f-string debugging with =
Python 3.8+ added f’{var=}‘ which prints the variable name and its value in one shot. This is faster than writing print(f’var = {var}’) and far more useful during exploration.
# f'{var=}': name + value, invaluable for debuggingloss: float=0.4231epoch: int=12learning_rate: float=0.001print(f"{loss=}") # loss=0.4231print(f"{epoch=}") # epoch=12print(f"{learning_rate=}") # learning_rate=0.001print(f"{loss:.4f}") # 0.4231 (formatted, no name)print(f"{loss *0.9=}") # loss * 0.9 = 0.38078999999999997 (expressions too)
loss=0.4231
epoch=12
learning_rate=0.001
0.4231
loss * 0.9 = 0.38078999999999996
Activity 1 - Annotate a Dataset Row
Goal: Replace each … with the correct type from the table above (int, float, str, bool, or float | None).
How to decide: look at the value on the right of = and ask: “Is it a whole number? A decimal? Text? True/False? Could it be missing?”
Expected: after filling in the hints, your editor should show no type errors.
# TODO: replace each ... with the correct type annotationcourse_code: ... ="CS301"credits: ... =3pass_rate: ... =0.87instructor: ... ="Dr. Nkosi"lab_room: ... =None# lab not yet assignedis_core_course: ... =True# When you are done, print each variable with f'{var=}'print(f"{course_code=}")print(f"{credits=}")
PEP 8 (Python Enhancement Proposal 8) is the official Python style guide, written by Python’s creator Guido van Rossum. Every serious Python project follows it; the linter ruff enforces it automatically (ruff check .).
Python defines four naming styles. Each signals a specific role in the language:
snake_case
All lowercase, words joined by underscores. The default style for everything that is not a class or a constant: variables, functions, method names, and module file names.
Every word starts with a capital letter; no underscores. Reserved exclusively for class names, NamedTuples, and TypedDicts – anything that defines a new type.
All uppercase, words separated by underscores. Use only for module-level constants: values set once, never reassigned. The style signals to every reader: “do not change this.”
A single underscore prefix signals that a name is private / internal – an implementation detail not meant to be called from outside the module or class. Python does not enforce this; it is a convention your team respects.
snake_case for variables & functions | PascalCase for classes & types | UPPER_SNAKE for constants | _leading for internals. The computer ignores these conventions. Your teammates will not. Run ruff check . to catch violations automatically.
Common Mistake: Mixing styles
StudentGPA = 3.85 looks like a class (PascalCase), not a variable. LOAD_DATA = lambda: … looks like a constant, not a function. Misleading names cause bugs that are hard to find. Be consistent.
# snake_case: variables and functionsmax_epochs: int=100learning_rate: float=0.001model_accuracy: float=0.945is_converged: bool=False# bool names read like a yes/no questionstudent_gpa_scores: list[float] = [3.95, 3.45, 3.88]# UPPER_SNAKE_CASE: module-level constantsMAX_BATCH_SIZE: int=32DATASET_PATH: str="data/students.csv"# Avoid: cryptic abbreviations# lr = 0.001 # unclear: is this learning rate? loss ratio?# ma = 0.945 # unclear# b = 32 # unclear# ruff catches naming violations:# ruff check tutorials/ --> E741 Ambiguous variable name: 'l'print(f"Accuracy: {model_accuracy:.1%}") # .1% formats as a percentageprint(f"Converged: {is_converged}")print(f"Dataset: {DATASET_PATH}")
A string is any piece of text: a student name, a course code, a log message, a file path. Create one by wrapping text in matching quotes:
name ='Alice Kamau'# single quotescourse ="Machine Learning"# double quotes - both work identically
Strings are used constantly in data science: reading CSV column headers, cleaning field values, building file paths, and formatting model output. Python provides dozens of built-in methods, no imports needed.
Key Concept: Strings are Immutable Sequences
A str is an ordered, immutable sequence of Unicode characters. Every string method returns a new string. The original is never changed. In data science you use strings to parse CSV rows, clean field values, build file paths, and format model output. Mastering the handful of methods below covers 95% of string work you will encounter.
# f-strings: the standard for formatting output in Python 3.6+name: str="Alice Kamau"score: float=87.5rank: int=3print(f"Student : {name}")print(f"Score : {score:.1f}%") # one decimal placeprint(f"Score : {score:.0f}%") # rounded to integerprint(f"Rank : #{rank:02d}") # zero-padded two digitsprint(f"Pass? : {'Yes'if score >=70else'No'}")
Alignment specifiers ({name:<8}, {score:5.1f}) format values into fixed-width columns. This cell uses a for loop for display; for loops are covered properly in Part 2. Read for name, s in [...]: as “for each (name, score) pair in the list, do this”:
# Alignment: useful for building readable reportsfor student, s in [("Alice", 92.1), ("Bob", 74.8), ("Carol", 88.5)]: bar ="#"*int(s //10)print(f"{student:<8}{s:5.1f}{bar}")
Alice 92.1 #########
Bob 74.8 #######
Carol 88.5 ########
Pro Tip: Recognising Older Formatting Styles
You will encounter two older styles in legacy code and tutorials. Know them so you can read them, but write f-strings.
print(“Accuracy: %d%%” % 92) ← %-formatting (Python 2 era, still valid) print(“Accuracy: {}”.format(92)) ← .format() (Python 3.0+, more flexible than %) print(f”Accuracy: {acc}“) ← f-strings (Python 3.6+, fastest and most readable, use this)
Cleaning & Parsing
Real-world data always arrives dirty: extra spaces, inconsistent delimiters, mixed case. strip() + split() is the most common two-step clean-up in any data pipeline:
# Cleaning and parsing: the most common string operations in data workraw_row: str=" Alice Kamau , 2024001 , 3.95 , Computer Science "# strip() removes leading and trailing whitespacecleaned: str= raw_row.strip()# split() on a delimiter returns a list; strip each part tooparts: list[str] = [p.strip() for p in cleaned.split(",")]name, sid, gpa_str, major = partsprint(f"Name : {name!r}")print(f"ID : {sid}")print(f"GPA : {float(gpa_str):.2f}")print(f"Major : {major}")
Name : 'Alice Kamau'
ID : 2024001
GPA : 3.95
Major : Computer Science
join() is the inverse of split(). It reassembles a list of strings into one string with a chosen separator. replace() and case methods normalise individual field values:
# join() is the inverse of split(): reassemble with a new delimitertsv_row: str="\t".join(parts)print(f"TSV : {tsv_row!r}")# replace(): swap delimiters or fix typosprint(cleaned.replace(",", " |"))# Case methodstag: str=" machine_learning "print(tag.strip().replace("_", " ").title())
Test membership, find positions, and count occurrences, all without writing a loop:
# Searching strings: common in log parsing and feature extractionlog: str="[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)"print(f"starts with [ERROR] : {log.startswith('[ERROR]')}")print(f"ends with threshold : {log.endswith('threshold')}")print(f'contains "loss" : {"loss"in log}')print(f'find "epoch" : index {log.find("epoch")}')print(f'count of "e" : {log.count("e")}')
starts with [ERROR] : True
ends with threshold : False
contains "loss" : True
find "epoch" : index 8
count of "e" : 6
String slicing (s[start:stop]) extracts a substring by position, using the same syntax as list slicing. rpartition(sep) splits at the last occurrence of sep, returning (before, sep, after), the cleanest way to separate a filename from its extension:
log: str="[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)"# Extract structured data from a log lineepoch_part: str= log.split("epoch ")[1].split(":")[0]print(f"Epoch number : {epoch_part}")# Slicing: same rules as listsprefix: str= log[:7] # '[ERROR]'body: str= log[9:]print(f"Prefix : {prefix!r}")print(f"Body : {body!r}")# rpartition(): split at the LAST occurrence of a separatorfilename: str="model_experiment_run_42.parquet"stem, _, ext = filename.rpartition(".")print(f"stem={stem!r} ext={ext!r}")
Epoch number : 42
Prefix : '[ERROR]'
Body : 'poch 42: validation loss exceeded threshold (loss=1.234)'
stem='model_experiment_run_42' ext='parquet'
Activity 2 - Parse a Messy Log Line
Goal: Extract the model name, epoch, and loss value from the raw log string below into typed variables.
When to use a list: - Order matters: items have a defined first and last position - You need to add, remove, or change elements after creation - You are collecting results in a loop: training losses, processed records, file paths
Key operations at a glance:
Operation
Syntax
Notes
Index
a[i]
0-based; negative counts from end
Slice
a[start:stop:step]
Returns new list; stop is exclusive
Append
a.append(x)
Add one item to the end
Extend
a.extend(iterable)
Add all items from another sequence
Insert
a.insert(i, x)
Insert before index i
Remove
a.remove(x)
Remove first occurrence of value x
Pop
a.pop(i)
Remove & return item at index i (default: last)
Delete
del a[i] / del a[i:j]
Remove item or slice, returns nothing
Clear
a.clear()
Remove all items (same as del a[:])
Membership
x in a
Returns True / False
Length
len(a)
Number of items
Sort
a.sort() / sorted(a)
In-place vs new list
Count
a.count(x)
Occurrences of x
Index
a.index(x)
Position of first x
Copy
a.copy()
Shallow independent copy
Key Concept: Ordered & Mutable
A list maintains insertion order and supports in-place modification. Annotate as list[int] (Python 3.9+, no import needed). Full reference: docs.python.org: 5.1 More on Lists
Common Mistake: Assignment Is Not a Copy
b = a makes b point to the same list. Mutating b also changes a. Use b = a.copy() or b = a[:] for an independent copy.
# Quiz scores for a cohort of studentsquiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]# Indexing: 0-based; negative index counts from the endprint(f"First : {quiz_scores[0]}")print(f"Last : {quiz_scores[-1]}")print(f"[1:4] : {quiz_scores[1:4]}")print(f"[::2] : {quiz_scores[::2]}") # every other element# Aggregatesn: int=len(quiz_scores)mean: float=sum(quiz_scores) / nprint(f"n={n} min={min(quiz_scores)} max={max(quiz_scores)} mean={mean:.1f}")
= copies the reference, not the data. Both names then point to the same list in memory. Confirm the difference between a reference and an independent copy:
# Copy vs reference: a critical distinctionquiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]backup: list[float] = quiz_scores.copy() # independent copyref: list[float] = quiz_scores # same object!quiz_scores[0] =99.0print("After quiz_scores[0] = 99.0:")print(f" quiz_scores[0] : {quiz_scores[0]}")print(f" ref[0] : {ref[0]}") # also changed: same objectprint(f" backup[0] : {backup[0]}") # unchanged: independent copy
Mutability means a value can be changed after it is created. A list is mutable: you can add, remove, or replace any element at any time, without creating a new list. This is unlike strings and tuples, which are immutable: once created, their contents cannot change.
Type
Mutable?
What it means
list
Yes
Change any element, add or remove items freely: scores[0] = 99
str
No
Methods like .upper() return a new string; the original is untouched
tuple
No
Elements are fixed at creation and cannot be reassigned
Because lists are mutable, the methods below modify the original list in place and return None, not a new list.
scores: list[float] = [85.0, 92.0, 78.0, 65.0, 88.0]#: Adding items --scores.append(95.0) # add one item to the end [85, 92, 78, 65, 88, 95]scores.insert(1, 90.0) # insert 90.0 before index 1scores.extend([81.5, 76.0]) # add all items from another list#: Removing items --scores.remove(65.0) # remove first occurrence of 65.0 (raises ValueError if absent)last = scores.pop() # remove and return last itemsecond = scores.pop(1) # remove and return item at index 1del scores[0] # remove item at index 0 (no return value)# del scores[1:3] # delete a slice: removes multiple items at once#: Membership test --print(f"95.0 in scores : {95.0in scores}") # True / Falseprint(f"999.0 in scores : {999.0in scores}")print(f"scores : {scores}")print(f"popped : last={last}, second={second}")
A stack is a Last-In, First-Out (LIFO) structure: the last item appended is the first one popped. Lists implement this naturally with append() + pop().
clear() removes all items from the list in place (equivalent to del a[:]).
# List as a LIFO stack: useful for depth-first search, undo history, backtrackingcall_stack: list[str] = []# Pushcall_stack.append("load_data")call_stack.append("clean_data")call_stack.append("train_model")print(f"Stack (top is last) : {call_stack}")# Pop (LIFO order)while call_stack: task = call_stack.pop()print(f" Processing: {task}")print(f"Stack after popping : {call_stack}")# clear(): empty a list in place (the name 'call_stack' still exists)call_stack.extend(["task_a", "task_b", "task_c"])call_stack.clear()print(f"After clear() : {call_stack}") # []
Stack (top is last) : ['load_data', 'clean_data', 'train_model']
Processing: train_model
Processing: clean_data
Processing: load_data
Stack after popping : []
After clear() : []
sorted() returns a new sorted list; .sort() modifies the list in place and returns None. Assigning the result of .sort() is a common silent bug:
# sorted() returns a new list; .sort() modifies in placeascending: list[float] =sorted(scores)descending: list[float] =sorted(scores, reverse=True)print(f"asc : {ascending}")print(f"desc : {descending}")# Searchprint(f"count of 85.0 : {scores.count(85.0)}")print(f"index of 92.0 : {scores.index(92.0)}")
asc : [78.0, 81.5, 88.0, 92.0, 95.0]
desc : [95.0, 92.0, 88.0, 81.5, 78.0]
count of 85.0 : 0
index of 92.0 : 0
Activity 3 - Summarise a Score List
Goal: Given the raw scores below, produce a cleaned, sorted list and a summary string.
---------------------------------------------------------------------------TypeError Traceback (most recent call last)
CellIn[22], line 6 4 clean: list[float] = ...
5print(f"clean : {clean}")
----> 6print(f"n={len(clean)} min={min(clean)} max={max(clean)} mean={sum(clean)/len(clean):.1f}")
TypeError: object of type 'ellipsis' has no len()
4. Collections: Tuple & NamedTuple
A tuple is an ordered, immutable sequence, similar to a list, but its contents are fixed at creation. You cannot add, remove, or change any element.
Immutable means locked. Once you write coords = (1.29, 36.82), those two numbers cannot be replaced. This is intentional: immutability makes tuples safe to use as dictionary keys, pass between functions, and share across threads without risk of accidental modification.
When to use a tuple: - The number of elements is fixed by design (a coordinate pair is always 2 values) - Returning multiple values from a function (Python packs them into a tuple) - You need a hashable key for a dict or set (lists cannot be dict keys) - Signalling to a reader that this data must not change
Key operations at a glance:
Operation
Syntax
Notes
Index
t[i]
Same as list; negative index counts from end
Slice
t[start:stop:step]
Returns a new tuple
Unpack
a, b, c = t
Assign each element to a name
Extended unpack
first, *rest = t
*rest collects remaining into a list
Swap
a, b = b, a
Pythonic; no temporary variable needed
Length
len(t)
Number of elements
Membership
x in t
True / False
Count
t.count(x)
Number of occurrences of x
Find
t.index(x)
Index of first occurrence of x
Concatenate
t1 + t2
Returns a new, longer tuple
Key Concept: Ordered & Immutable
Use a tuple for data that must not change: coordinate pairs, database rows, function return values. Annotate the type of each position: tuple[str, int, float].
typing.NamedTuple adds field names and type hints, giving you a lightweight, typed, self-documenting record with zero runtime overhead over a plain tuple.
Class syntax note:NamedTuple uses the class keyword. Full class mechanics are covered in Part 3. For now, read class Foo(NamedTuple): as “define a named-tuple type called Foo with these fields.”
# Tuple: annotate with the exact types of each positionrecord: tuple[str, int, float] = ("Alice Kamau", 2024001, 3.95)# Unpack all elements at oncename, student_id, gpa = recordprint(f"{name=}{student_id=}{gpa=}")# Extended unpacking with *first, *middle, last = (82.0, 91.5, 74.0, 88.0, 95.5)print(f"{first=}{middle=}{last=}")
Python’s swap idiom packs two values into a tuple and immediately unpacks them in the opposite order, no temporary variable needed. Tuples also enforce immutability at runtime:
# Pythonic variable swap: no temp variable neededx, y ="train", "val"x, y = y, xprint(f"After swap: {x=}{y=}")# Immutability: tuples cannot be changed after creationrecord: tuple[str, int, float] = ("Alice Kamau", 2024001, 3.95)try: record[0] ="Bob"# type: ignore[index]exceptTypeErroras exc:print(f"Immutable: {exc}")
After swap: x='val' y='train'
Immutable: 'tuple' object does not support item assignment
NamedTuple: Named, Typed Fields
NamedTuple gives a plain tuple field names and type annotations. It uses class syntax (see the note in the section header). For now, read this as “create a named tuple type with these typed fields”:
from typing import NamedTupleclass StudentRecord(NamedTuple):"""Typed, immutable student record.""" name: str student_id: int gpa: float major: str="Undeclared"# field with default value
Create instances by calling the class like a function. __repr__ is generated automatically: field names appear in the output:
Access fields by name for readability or by index for tuple-compatible tools. _replace() returns a new record with selected fields updated. The original is immutable and unchanged:
# Access by name (readable) or by index (tuple-compatible)print(f"{alice.name}, GPA: {alice.gpa}")print(f"By index alice[2]: {alice[2]}")# _replace() creates a new record with selected fields changedalice_updated = alice._replace(gpa=3.97)print(f"Updated: {alice_updated}")# NamedTuples unpack just like plain tuplesname, sid, gpa, major = aliceprint(f"Unpacked: {name}, {major}")
Alice Kamau, GPA: 3.95
By index alice[2]: 3.95
Updated: StudentRecord(name='Alice Kamau', student_id=2024001, gpa=3.97, major='Computer Science')
Unpacked: Alice Kamau, Computer Science
5. Collections: Dict
A dictionary (dict) maps unique keys to values. Think of it as a lookup table: given a key, you get back its associated value in O(1) time: instantly, regardless of how many entries the dict contains.
Unlike a list (where you access items by numeric position), a dict lets you access data by a meaningful label:
student = {'name': 'Alice', 'gpa': 3.95, 'enrolled': True}student['gpa'] # 3.95 - by label, not by positionstudent.get('age') # None - safe access, no KeyError
When to use a dict: - Access by name: student record, model config, API response payload - Counting occurrences: {'cat': 3, 'dog': 1, 'bird': 2} - Grouping: {course_id: [student, student, ...]}
Python 3.7+ dicts preserve insertion order: you get keys back in the order you added them.
Key operations at a glance:
Operation
Syntax
Notes
Access
d[key]
Raises KeyError if key is missing
Safe access
d.get(key, default)
Returns default (or None) if key missing
Add / update
d[key] = value
Creates key if absent; overwrites if present
Bulk update
d.update(other)
Merge another dict or iterable of pairs
Remove
d.pop(key)
Remove and return value; KeyError if absent
Remove (safe)
d.pop(key, default)
Returns default instead of raising
Delete
del d[key]
Remove key in place; no return value
Clear
d.clear()
Remove all pairs; dict remains (now empty)
Membership
key in d
Checks keys only, O(1)
Keys
d.keys()
Live view of all keys
Values
d.values()
Live view of all values
Pairs
d.items()
Live view of (key, value) tuples, used in for loops
Length
len(d)
Number of key-value pairs
Merge (3.9+)
a \| b
New merged dict; right side wins on conflicts
Merge in-place
a \|= b
Update a with b in place
Copy
d.copy()
Shallow independent copy
Key Concept: Key-Value Map
A dict maps unique, hashable keys to values. Insertion order is preserved (Python 3.7+). Use dict[str, float] to annotate key and value types.
TypedDict (Python 3.8+) defines a typed schema for a dict, essential for model configs and API payloads where every key and its type must be known.
# Course record as a dictcourse: dict[str, object] = {"code": "CS301","title": "Machine Learning","credits": 3,"enrollment": 42,"pass_rate": 0.87,}# Access: [] raises KeyError on missing key; .get() returns a defaultprint(course["title"])print(course.get("lab_room", "TBA"))# Membership checks keysprint(f'"pass_rate" in course : {"pass_rate"in course}')print(f'"semester" in course : {"semester"in course}')
Machine Learning
TBA
"pass_rate" in course : True
"semester" in course : False
Modifying a Dict
Dicts are mutable: you can add, change, and remove keys after creation. .pop() removes a key and returns its value. .items() gives (key, value) pairs for iteration (for loops are covered in Part 2):
# Add / update / removecourse["lab_room"] ="Lab 3A"course.update({"enrollment": 45, "semester": "Fall 2024"})semester = course.pop("semester") # remove and return# Iterate over all key-value pairsfor key, value in course.items():print(f" {key:<12} : {value}")
a | b creates a new merged dict; the right-hand side wins on key conflicts. a |= b merges b into a in place. This replaces the older {**a, **b} pattern:
# Python 3.9+ dict merge operator | and |=# Replaces the older {**a, **b} pattern: cleaner and fasterdefault_config: dict[str, object] = {"learning_rate": 0.001,"epochs": 10,"batch_size": 32,"optimizer": "adam",}run_overrides: dict[str, object] = {"epochs": 50, # override"batch_size": 64, # override"dropout": 0.2, # new key}# | creates a NEW merged dict; right side wins on key conflictsrun_config = default_config | run_overridesprint("Merged run config:")for k, v in run_config.items():print(f" {k:<16}: {v}")# |= updates the dict in placedefault_config |= {"weight_decay": 1e-4}print(f"\ndefault_config after |=: {default_config}")
TypedDict defines which keys a dict must have and the type of each value. It uses class syntax (see section header note). At runtime it is a plain dict with zero overhead. The schema is enforced only by the type checker:
from typing import TypedDictclass ModelConfig(TypedDict): learning_rate: float epochs: int batch_size: int optimizer: strclass ExperimentResult(TypedDict): run_id: str accuracy: float val_loss: float
Annotate a variable with your TypedDict class. The type checker flags wrong key names or value types. type(config) at runtime confirms it is simply a dict:
# TypedDict is a plain dict at runtime: no overhead# ty checks that keys and value types match the schemaconfig: ModelConfig = {"learning_rate": 0.001,"epochs": 50,"batch_size": 32,"optimizer": "adam",}result: ExperimentResult = {"run_id": "exp-2024-001","accuracy": 0.923,"val_loss": 0.218,}print(f"Config : {config}")print(f"Result : {result}")print(f"Accuracy: {result['accuracy']:.1%}")print(f"type(config): {type(config)}")
A set is an unordered collection of unique values. Duplicates are discarded automatically. You never need to deduplicate manually.
Two properties make sets special:
Uniqueness: every value appears at most once, always
O(1) membership testing: x in my_set takes the same time whether the set has 10 or 10,000,000 items. The equivalent x in my_list slows down linearly.
When to use a set: - Removing duplicates from a list: unique = set(my_list) - Fast membership check: if label in valid_labels: - Data pipeline integrity: find overlap or difference between train/val/test IDs
Key operations at a glance:
Operation
Syntax / Method
Notes
Create
{1, 2, 3} or set(iterable)
{} creates a dict, use set() for empty
Add
s.add(x)
No effect if x already present
Remove
s.remove(x)
Raises KeyError if x absent
Remove (safe)
s.discard(x)
No error if x absent
Pop
s.pop()
Remove and return an arbitrary element
Clear
s.clear()
Remove all elements
Membership
x in s
O(1), instant regardless of set size
Length
len(s)
Number of elements
Union
s \| t or s.union(t)
All elements from both sets
Intersection
s & t or s.intersection(t)
Elements present in both
Difference
s - t or s.difference(t)
In s but not in t
Symmetric diff
s ^ t or s.symmetric_difference(t)
In one but not both
Subset
s <= t or s.issubset(t)
Every element of s is in t
Superset
s >= t or s.issuperset(t)
Every element of t is in s
Disjoint
s.isdisjoint(t)
No elements in common
Immutable copy
frozenset(s)
Immutable set, can be used as a dict key
Key Concept: Unique Values & O(1) Lookup
A set never stores duplicates and tests membership in constant time. Annotate as set[str]. For an immutable, hashable set that can be used as a dict key, use frozenset.
Common Mistake: {} Is a Dict, Not a Set
empty = {} creates an empty dict. empty = set() creates an empty set. This trips up nearly every Python learner once. Now you know.
# Sets remove duplicates on creationraw_labels: list[str] = ["cat", "dog", "cat", "bird", "dog", "cat"]unique_labels: set[str] =set(raw_labels)print(f"raw : {raw_labels}")print(f"unique : {sorted(unique_labels)}")# O(1) membership test: much faster than list for large collectionsvalid_formats: set[str] = {"parquet", "csv", "json", "feather"}print(f"parquet valid : {'parquet'in valid_formats}")print(f"xlsx valid : {'xlsx'in valid_formats}")# Mutationvalid_formats.add("orc")valid_formats.discard("feather") # safe: no error if element is absentprint(f"formats : {sorted(valid_formats)}")
Confirm the {} gotcha by running this cell. The type output makes it unmistakable:
# GOTCHA: {} creates a dict, not a set: always use set() for an empty setempty_dict = {}empty_set =set()print(f"type({{}}) : {type(empty_dict)}")print(f"type(set()) : {type(empty_set)}")
Sets support mathematical operations directly with operators. These are invaluable for data-pipeline integrity checks such as detecting train/validation leakage:
# Set algebra: very common in data pipeline checkstrain_ids: set[int] = {101, 102, 103, 104, 105, 106, 107, 108}val_ids: set[int] = {107, 108, 109, 110}print(f"Union : {sorted(train_ids | val_ids)}")print(f"Intersection : {sorted(train_ids & val_ids)}")print(f"Difference : {sorted(train_ids - val_ids)}")print(f"Sym. diff : {sorted(train_ids ^ val_ids)}")# Practical: check for data leakage between splitsleakage: set[int] = train_ids & val_idsif leakage:print(f"\nWARNING: {len(leakage)} IDs in both train and val : data leakage! {leakage}")else:print("\nNo data leakage between splits.")
Union : [101, 102, 103, 104, 105, 106, 107, 108, 109, 110]
Intersection : [107, 108]
Difference : [101, 102, 103, 104, 105, 106]
Sym. diff : [101, 102, 103, 104, 105, 106, 109, 110]
WARNING: 2 IDs in both train and val : data leakage! {107, 108}
7. Standard Library Collections
Python ships with a large collection of ready-to-use modules called the standard library: available without any pip install. The collections module contains specialised containers that solve common data patterns more cleanly than plain list and dict.
Key Concept: Specialised Containers from collections
Three tools from the standard library cover the most common data-science patterns beyond the built-in types:
Counter: count occurrences; perfect for label frequencies and class imbalance checks
defaultdict: group items without writing if key not in d: d[key] = []
deque: O(1) append and pop from both ends; ideal for sliding windows in time series
from collections import Counter# Class imbalance check using Counterpredicted_labels: list[str] = ["pass","pass","fail","pass","pass","fail","pass","pass","pass","fail","pass","pass",]counts: Counter[str] = Counter(predicted_labels)print(f"All counts : {counts}")print(f'"pass" : {counts["pass"]}')print(f"Unknown : {counts['unknown']}") # returns 0, not KeyErrorprint(f"Top 2 : {counts.most_common(2)}")
All counts : Counter({'pass': 9, 'fail': 3})
"pass" : 9
Unknown : 0
Top 2 : [('pass', 9), ('fail', 3)]
Build a class-distribution report and combine counters from multiple batches using Counter arithmetic: + merges counts, - subtracts (removing zeros):
# Class distribution reporttotal: int=sum(counts.values())for label, n in counts.most_common():print(f" {label:<8}: {n:2d}/{total} ({n / total:.1%})")# Counter arithmetic: combine counts from multiple batchesbatch_a: Counter[str] = Counter(["pass", "pass", "fail"])batch_b: Counter[str] = Counter(["fail", "fail", "pass"])combined = batch_a + batch_bprint(f"\nCombined batches: {combined}")
Goal: Use Counter to produce a class distribution report from the labels list below.
labels = ['A','B','A','C','B','A','A','B','C','A','B','A']
# Expected output
A : 6/12 (50.0%) [##############################]
B : 4/12 (33.3%) [#################### ]
C : 2/12 (16.7%) [########## ]
Hint: Build the bar with ‘#’ * int(pct * 30).
from collections import Counterlabels: list[str] = ["A", "B", "A", "C", "B", "A", "A", "B", "C", "A", "B", "A"]# TODO: print a class distribution reportcounts: Counter[str] = Counter(labels)total: int=sum(counts.values())for label, n in counts.most_common(): pct = n / total bar = ... # TODO: build the bar stringprint(f"{label} : {n}/{total} ({pct:.1%}) [{bar:<30}]")
---------------------------------------------------------------------------TypeError Traceback (most recent call last)
CellIn[39], line 12 10 pct = n / total
11 bar = ... # TODO: build the bar string---> 12print(f"{label} : {n}/{total} ({pct:.1%}) [{bar:<30}]")
TypeError: unsupported format string passed to ellipsis.__format__
defaultdict: Zero-Setup Grouping
defaultdict(factory) calls factory() to create a new default value whenever a missing key is accessed, eliminating the if key not in d: d[key] = [] boilerplate. defaultdict(list) is the standard pattern for grouping (uses a for loop, covered in Part 2):
from collections import defaultdictstudents: list[dict[str, object]] = [ {"name": "Alice", "major": "CS", "gpa": 3.95}, {"name": "Bob", "major": "Math", "gpa": 3.45}, {"name": "Carol", "major": "CS", "gpa": 3.88}, {"name": "Dan", "major": "Math", "gpa": 3.72}, {"name": "Eve", "major": "CS", "gpa": 3.60},]# Group students by major: no 'if key not in d: d[key] = []' neededby_major: defaultdict[str, list[str]] = defaultdict(list)for s in students: by_major[str(s["major"])].append(str(s["name"]))print("Students by major:")for major, names insorted(by_major.items()):print(f" {major}: {names}")
Students by major:
CS: ['Alice', 'Carol', 'Eve']
Math: ['Bob', 'Dan']
The same pattern works for numeric accumulation. defaultdict(float) starts every new key at 0.0, making sum-per-group pipelines one-liners:
# Accumulate GPA sums per majorgpa_total: defaultdict[str, float] = defaultdict(float)gpa_count: defaultdict[str, int] = defaultdict(int)for s in students: key =str(s["major"]) gpa_total[key] +=float(s["gpa"]) # type: ignore[arg-type] gpa_count[key] +=1print("Average GPA by major:")for major insorted(gpa_total):print(f" {major}: {gpa_total[major] / gpa_count[major]:.2f}")
Average GPA by major:
CS: 3.81
Math: 3.58
deque: Fixed-Size Rolling Buffer
deque(maxlen=N) discards the oldest element automatically when the buffer is full. This is the standard tool for rolling statistics over time-series streams:
from collections import deque# Rolling mean with maxlen: oldest element auto-discards when fulltemperature_readings: list[float] = [36.5,36.7,37.1,37.8,38.2,38.0,37.5,37.2,36.9,36.6,]WINDOW: int=3window: deque[float] = deque(maxlen=WINDOW)rolling_means: list[float] = []for reading in temperature_readings: window.append(reading)iflen(window) == WINDOW: rolling_means.append(round(sum(window) / WINDOW, 2))print(f"Readings : {temperature_readings}")print(f"Rolling mean-3: {rolling_means}")
The statistics module computes common descriptive statistics on plain Python lists, no NumPy required. Use it for quick sanity checks and lightweight scripts. For large arrays, NumPy is faster (covered in Part 4).
An operator is a symbol that performs a computation on one or two values. You already know arithmetic operators from mathematics. Python adds several more:
Category
Operators
Example
Arithmetic
+-*///%**
7 // 2 → 3
Comparison
==!=<><=>=
score >= 70 → True
Logical
andornot
a and b
Assignment expression
:= (walrus)
if (n := len(data)) > 10:
Three operator families matter most in data science work: arithmetic, comparison + logical, and the walrus:= (Python 3.8+).
# Weighted grade calculationmidterm: float=82.0final: float=91.0project: float=88.0weighted_grade = midterm *0.30+ final *0.50+ project *0.20print(f"Weighted grade: {weighted_grade:.1f}")# Augmented assignment modifies in placeloss: float=1.0for _ inrange(6): loss *=0.75print(f"Loss after 6 decay steps: {loss:.4f}")
Weighted grade: 87.7
Loss after 6 decay steps: 0.1780
/ and // are different operators. This is one of the most common Python gotchas: // floors toward negative infinity, not toward zero:
Comparison operators return bool. Logical operators combine conditions and use short-circuit evaluation: the right side is not evaluated if the left side already determines the result:
# Comparison and logical operatorsscore: float=84.5attendance: int=90passes = score >=70qualifies = score >=80and attendance >=85# both must be trueat_risk = score <60or attendance <70# either triggersnot_pass =not passesprint(f"{passes=}{qualifies=}{at_risk=}{not_pass=}")
Short-circuit evaluation prevents errors like dividing by an empty list. Use is/is not to check object identity (same object in memory) and ==/!= to check value equality:
# Short-circuit evaluation: right side is NOT evaluated if left decides outcomescores: list[float] |None= [82.0, 91.5, 74.0]mean = scores andsum(scores) /len(scores) # safe: skips divide if scores is Noneprint(f"mean (safe): {mean}")# Identity (is) vs equality (==)a: list[int] = [1, 2, 3]b: list[int] = [1, 2, 3]c: list[int] = aprint(f"a == b : {a == b}") # True : same valuesprint(f"a is b : {a is b}") # False: different objectsprint(f"a is c : {a is c}") # True : same object
mean (safe): 82.5
a == b : True
a is b : False
a is c : True
Walrus Operator := (Python 3.8+)
:= is an assignment expression: unlike = (a statement), it both assigns a value and evaluates to that value. This lets you assign inside a condition, avoiding computing the same value twice:
# Walrus operator := (Python 3.8+)# Assigns AND returns a value in the same expression.exam_scores: list[float] = [82.0, 91.5, 74.0, 88.0, 95.5, 64.0, 79.0]# Without walrus: must store mean manually before using in conditionm =sum(exam_scores) /len(exam_scores)if m <80:print(f"[without walrus] Cohort average is low: {m:.1f}")# With walrus: compute once, test, and use: all in one expressionif (m :=sum(exam_scores) /len(exam_scores)) <80:print(f"[with walrus] Cohort average is low: {m:.1f}")
:= is especially useful inside while loops that consume a stream and inside comprehensions that need to reuse a computed intermediate value (both covered in Part 2):
# Walrus in a while loop: consume a stream until None sentineldata_stream =iter([10.0, 20.0, 30.0, None, 40.0])while (value :=next(data_stream, None)) isnotNone:print(f" Read: {value}")# Walrus in a comprehension: strip once, reuse stripped valueraw_names: list[str] = [" Alice ", "", " Bob ", " ", " Carol "]clean_names: list[str] = [stripped for name in raw_names if (stripped := name.strip())]print(f"Clean names: {clean_names}")