Part 1: Language Core (Data & Types)

Open In Colab Download Notebook

DS-MLOps Python Foundations

Python 3.12+ | Author: Anthony Faustine

Part 1 covers Python’s core data vocabulary: variables, types, strings, the four collection types, the standard library’s extra collections, and operators. All examples come from a single realistic scenario: a university analytics platform that tracks student performance, course enrollment, and model experiment logs.

Part 2 (02-control-flow.ipynb) continues directly from this notebook with control flow and comprehensions. Read it right after this one to complete the language foundation.

Callout markers used throughout this notebook are explained on the book cover page.

Before You Begin

What is Python?

Python is a general-purpose programming language created in 1991. A programming language is a set of rules for writing instructions a computer can execute. Unlike a spreadsheet, code lets you automate tasks, process millions of data points, and build models that learn from data.

Why Python for data science and AI?

Python was not built for data science. It became the de facto standard because of three compounding advantages:

1. Readable syntax. Python code reads closer to plain English than any other mainstream language. A data scientist can focus on the algorithm, not the language syntax. for score in scores: total += score needs no translation.

2. A world-class numerical ecosystem. The entire scientific Python stack is Python-first:

Library What it does
NumPy Fast multi-dimensional arrays; the foundation everything else builds on
pandas Tabular data: load, clean, reshape, merge DataFrames
matplotlib / seaborn Visualisation: line charts, heatmaps, histograms
scikit-learn Classical ML: linear models, trees, SVMs, pipelines
PyTorch / JAX Deep learning: neural networks trained on GPU
HuggingFace Transformers Large language models and vision models

Every major AI breakthrough in the last decade (ResNet, BERT, GPT, Llama) was released as Python code. Reproducing or building on that research requires Python.

3. Interactive computing with Jupyter. Jupyter notebooks let you run one cell at a time, see results immediately, and iterate without a compile step. This matches how data exploration actually works: inspect the data, transform it, visualise, repeat.

What is a Jupyter notebook?

This file is a Jupyter notebook: a document that mixes formatted text (like this paragraph) with executable code. It consists of cells:

  • Markdown cells (like this one): formatted text, explanations, tables, equations.
  • Code cells (the grey boxes below): Python code. Press Shift + Enter to run a cell; its output appears directly below.

Always run cells from top to bottom. Later cells often use variables created in earlier ones. If something breaks, use Kernel → Restart & Run All to start fresh.

What we will build together

Every example uses the same scenario: a university analytics platform tracking student scores, course enrollments, and ML experiment logs. The same data structures recur across every section so you can focus on the Python concept, not a new domain each time.

By the end of Part 2 you will have the full language foundation needed to work with real datasets using NumPy and pandas.

Python vs other languages

The best way to appreciate Python’s readability is side-by-side comparison. Here is “print Hello” in three languages:

Java: 5 lines of ceremony for one instruction:

public class Hello {
    public static void main(String[] args) {
        System.out.println("Hello");
    }
}

C++: requires a header and an entry-point function:

#include <iostream>
int main() { std::cout << "Hello\n"; }

Python: one line that reads like English:

print("Hello")

This gap widens as programs grow. A 300-line data pipeline in Python stays readable; the equivalent in Java or C++ becomes much harder to navigate. That is why Python dominates explorative, iterative work like data science and ML.

Pro Tip: Running notebooks in VS Code

You can run all notebooks in this book inside VS Code with the IDE Setup tutorial (Part 12) guiding you through the setup. VS Code gives you IntelliSense inside cells, a Variable Inspector, and integrated git, all without leaving the editor. If you prefer JupyterLab, skip ahead — every notebook works identically there.

By the end of Part 1 you will be able to:

# Skill Covered in
1 Annotate variables with type hints (list[float], X | None) Sec. 1
2 Apply PEP 8 naming conventions (snake_case, PascalCase, UPPER_SNAKE) Sec. 1.4
3 Clean, parse, and format strings Sec. 2
4 Choose the right collection for any task Sec. 3-7
5 Use dict | merge, TypedDict, and NamedTuple Sec. 5, 4
6 Apply the walrus operator := where it clarifies code Sec. 8

Note on forward references: Sections 2-7 occasionally use for loops and class definitions before they are formally introduced. for loops are covered in Part 2 (02-control-flow.ipynb); classes are covered in Part 3 (03-python-patterns.ipynb). Whenever you see for item in collection: early, read it as “repeat this block once per item.” Full explanations follow in their dedicated sections.

1. Variables, Types & Type Hints

What is a variable?

A variable is a named container that stores a value in your program’s memory. Think of it as a labelled box:

name     ──►  "Alice Kamau"
gpa      ──►  3.85
enrolled ──►  True

You create a variable with the assignment operator =:

name = "Alice Kamau"   # create a box called 'name', put the value in it

⚠️ The = sign in Python means assign (store this value). It is NOT the mathematical equals sign. To check equality, use == (two equals signs).

Python’s four core types

Every value has a type: a label describing what kind of data it is:

Type What it stores Examples Real-world use
int Whole numbers 42, 2024001, -7 Student IDs, epoch counts, ranks
float Decimal numbers 3.85, 0.001, 92.3 GPA, learning rate, accuracy
str Text (any characters) 'Alice', "CS301" Names, labels, file paths
bool True or False only True, False “Is enrolled?”, “Did it converge?”

Python figures out the type of every value automatically. You never need to declare it.

Why add type hints?

Without hints, Python happily lets you store the wrong type in a variable:

gpa = 3.85       # float ✓
gpa = "unknown"  # str  - legal but wrong! breaks any later calculation

Type hints are optional annotations that make your intent explicit so that tools can catch mistakes like the one above:

gpa: float = 3.85   # hint says this must be a float

The syntax is name: type = value. Hints are not enforced at runtime: Python will not crash if you violate them, but the type checker ty will report an error the moment you try to assign the wrong type.

Python 3.9+: list[int], dict[str, float] (no imports needed) Python 3.10+: float | None means “a float, or nothing” (replaces Optional[float])

Key Concept: Type Hints

A type hint annotates what type a variable should hold: name: str = ‘Alice’. Hints are read by the type checker (ty) and your editor, not enforced at runtime. Annotate every variable, function parameter, and return value you write.

Start with the simplest possible case: create a few variables and print them. No type hints yet, just the core idea of “give a name to a value”:

# Your first Python variables: no type hints yet
# The = sign puts the value on the right into the name on the left
name = "Alice Kamau"  # text value (str)
score = 87.5  # decimal number (float)
rank = 1  # whole number (int)
enrolled = True  # True or False (bool)

# print() displays a value in the output area below this cell
print(name)
print(score)
print(rank)
print(enrolled)
Alice Kamau
87.5
1
True

Python knows the type of every value. type() reveals it, and isinstance() tests whether a value belongs to a given type. Run this cell to confirm:

# type() tells you what Python has inferred
print(type(name))  # <class 'str'>
print(type(score))  # <class 'float'>
print(type(rank))  # <class 'int'>
print(type(enrolled))  # <class 'bool'>

# Without hints, Python lets you overwrite with the wrong type: silently
rank = "first"  # rank was an int, now it's a str: Python allows it
print(f"rank is now a {type(rank).__name__}")  # str!
<class 'str'>
<class 'float'>
<class 'int'>
<class 'bool'>
rank is now a str

That last reassignment (rank = 'first') would silently break any code that later tries to do arithmetic with rank. Type hints prevent this by making your intent explicit. Now see the same variables with proper annotations:

# --- Student enrollment record ---
student_id: int = 2024001
full_name: str = "Maria Garcia"
gpa: float = 3.85
is_enrolled: bool = True
scholarship_amount: float | None = None  # union type: float or None (Python 3.10+)

print(f"Student : {full_name} (ID: {student_id})")
print(f"GPA     : {gpa}  Enrolled: {is_enrolled}")
print(f"Scholar.: {scholarship_amount}")
Student : Maria Garcia (ID: 2024001)
GPA     : 3.85  Enrolled: True
Scholar.: None

Run this to see Python’s runtime type information. isinstance() is preferred over type() because it handles class hierarchies. bool is a subclass of int, so isinstance(True, int) returns True:

# isinstance() is preferred over type() for checks: handles subclasses
print(f"type(gpa)                      -> {type(gpa)}")
print(f"isinstance(gpa, float)         -> {isinstance(gpa, float)}")
print(f"isinstance(gpa, int | float)  -> {isinstance(gpa, int | float)}")
type(gpa)                      -> <class 'float'>
isinstance(gpa, float)         -> True
isinstance(gpa, int | float)  -> True

Python also has a built-in complex number type, used in signal processing and Fourier analysis:

# complex numbers: real + imaginary parts
frequency: complex = 3 + 2j  # j is the imaginary unit
print(f"complex   : {frequency}")
print(f"real part : {frequency.real}")
print(f"imag part : {frequency.imag}")
print(f"magnitude : {abs(frequency):.3f}")  # |z| = sqrt(real² + imag²)
complex   : (3+2j)
real part : 3.0
imag part : 2.0
magnitude : 3.606

Pro Tip: f-string debugging with =

Python 3.8+ added f’{var=}‘ which prints the variable name and its value in one shot. This is faster than writing print(f’var = {var}’) and far more useful during exploration.

# f'{var=}': name + value, invaluable for debugging
loss: float = 0.4231
epoch: int = 12
learning_rate: float = 0.001

print(f"{loss=}")  # loss=0.4231
print(f"{epoch=}")  # epoch=12
print(f"{learning_rate=}")  # learning_rate=0.001
print(f"{loss:.4f}")  # 0.4231  (formatted, no name)
print(f"{loss * 0.9 = }")  # loss * 0.9 = 0.38078999999999997 (expressions too)
loss=0.4231
epoch=12
learning_rate=0.001
0.4231
loss * 0.9 = 0.38078999999999996

Activity 1 - Annotate a Dataset Row

Goal: Replace each with the correct type from the table above (int, float, str, bool, or float | None).

How to decide: look at the value on the right of = and ask: “Is it a whole number? A decimal? Text? True/False? Could it be missing?”

Expected: after filling in the hints, your editor should show no type errors.

# TODO: replace each ... with the correct type annotation
course_code: ... = "CS301"
credits: ... = 3
pass_rate: ... = 0.87
instructor: ... = "Dr. Nkosi"
lab_room: ... = None  # lab not yet assigned
is_core_course: ... = True

# When you are done, print each variable with f'{var=}'
print(f"{course_code=}")
print(f"{credits=}")
course_code='CS301'
credits=3

1.4 Naming Conventions (PEP 8)

PEP 8 (Python Enhancement Proposal 8) is the official Python style guide, written by Python’s creator Guido van Rossum. Every serious Python project follows it; the linter ruff enforces it automatically (ruff check .).

Python defines four naming styles. Each signals a specific role in the language:

snake_case

All lowercase, words joined by underscores. The default style for everything that is not a class or a constant: variables, functions, method names, and module file names.

student_gpa    = 3.85     # variable
learning_rate  = 0.001    # variable
def load_data(): ...      # function name
# module file: data_loader.py

PascalCase (also called UpperCamelCase)

Every word starts with a capital letter; no underscores. Reserved exclusively for class names, NamedTuples, and TypedDicts – anything that defines a new type.

class StudentRecord: ...      # class
class ModelConfig: ...        # class
class ExperimentRow(TypedDict): ...   # TypedDict

UPPER_SNAKE_CASE

All uppercase, words separated by underscores. Use only for module-level constants: values set once, never reassigned. The style signals to every reader: “do not change this.”

MAX_EPOCHS       = 100
BASE_LEARNING_RATE = 0.001
DATASET_PATH     = 'data/students.csv'

_leading_underscore

A single underscore prefix signals that a name is private / internal – an implementation detail not meant to be called from outside the module or class. Python does not enforce this; it is a convention your team respects.

def _validate_scores(scores): ...  # internal helper
_cache: dict[str, float] = {}      # internal state

Key Concept: PEP 8 naming

snake_case for variables & functions  |  PascalCase for classes & types  |  UPPER_SNAKE for constants  |  _leading for internals.
The computer ignores these conventions. Your teammates will not. Run ruff check . to catch violations automatically.

Common Mistake: Mixing styles

StudentGPA = 3.85 looks like a class (PascalCase), not a variable.
LOAD_DATA = lambda: … looks like a constant, not a function.
Misleading names cause bugs that are hard to find. Be consistent.

# snake_case: variables and functions
max_epochs: int = 100
learning_rate: float = 0.001
model_accuracy: float = 0.945
is_converged: bool = False  # bool names read like a yes/no question
student_gpa_scores: list[float] = [3.95, 3.45, 3.88]

# UPPER_SNAKE_CASE: module-level constants
MAX_BATCH_SIZE: int = 32
DATASET_PATH: str = "data/students.csv"

# Avoid: cryptic abbreviations
# lr   = 0.001    # unclear: is this learning rate? loss ratio?
# ma   = 0.945    # unclear
# b    = 32       # unclear

# ruff catches naming violations:
#   ruff check tutorials/  -->  E741 Ambiguous variable name: 'l'

print(f"Accuracy: {model_accuracy:.1%}")  # .1% formats as a percentage
print(f"Converged: {is_converged}")
print(f"Dataset: {DATASET_PATH}")
Accuracy: 94.5%
Converged: False
Dataset: data/students.csv

2. Strings & String Methods

A string is any piece of text: a student name, a course code, a log message, a file path. Create one by wrapping text in matching quotes:

name   = 'Alice Kamau'       # single quotes
course = "Machine Learning"  # double quotes - both work identically

Strings are used constantly in data science: reading CSV column headers, cleaning field values, building file paths, and formatting model output. Python provides dozens of built-in methods, no imports needed.

Key Concept: Strings are Immutable Sequences

A str is an ordered, immutable sequence of Unicode characters. Every string method returns a new string. The original is never changed. In data science you use strings to parse CSV rows, clean field values, build file paths, and format model output. Mastering the handful of methods below covers 95% of string work you will encounter.

# f-strings: the standard for formatting output in Python 3.6+
name: str = "Alice Kamau"
score: float = 87.5
rank: int = 3

print(f"Student : {name}")
print(f"Score   : {score:.1f}%")  # one decimal place
print(f"Score   : {score:.0f}%")  # rounded to integer
print(f"Rank    : #{rank:02d}")  # zero-padded two digits
print(f"Pass?   : {'Yes' if score >= 70 else 'No'}")
Student : Alice Kamau
Score   : 87.5%
Score   : 88%
Rank    : #03
Pass?   : Yes

Alignment specifiers ({name:<8}, {score:5.1f}) format values into fixed-width columns. This cell uses a for loop for display; for loops are covered properly in Part 2. Read for name, s in [...]: as “for each (name, score) pair in the list, do this”:

# Alignment: useful for building readable reports
for student, s in [("Alice", 92.1), ("Bob", 74.8), ("Carol", 88.5)]:
    bar = "#" * int(s // 10)
    print(f"{student:<8} {s:5.1f}  {bar}")
Alice     92.1  #########
Bob       74.8  #######
Carol     88.5  ########

Pro Tip: Recognising Older Formatting Styles

You will encounter two older styles in legacy code and tutorials. Know them so you can read them, but write f-strings.

print(“Accuracy: %d%%” % 92)    ← %-formatting (Python 2 era, still valid)
print(“Accuracy: {}”.format(92))    ← .format() (Python 3.0+, more flexible than %)
print(f”Accuracy: {acc}“)    ← f-strings (Python 3.6+, fastest and most readable, use this)

Cleaning & Parsing

Real-world data always arrives dirty: extra spaces, inconsistent delimiters, mixed case. strip() + split() is the most common two-step clean-up in any data pipeline:

# Cleaning and parsing: the most common string operations in data work
raw_row: str = "  Alice Kamau , 2024001 , 3.95 , Computer Science  "

# strip() removes leading and trailing whitespace
cleaned: str = raw_row.strip()

# split() on a delimiter returns a list; strip each part too
parts: list[str] = [p.strip() for p in cleaned.split(",")]
name, sid, gpa_str, major = parts

print(f"Name  : {name!r}")
print(f"ID    : {sid}")
print(f"GPA   : {float(gpa_str):.2f}")
print(f"Major : {major}")
Name  : 'Alice Kamau'
ID    : 2024001
GPA   : 3.95
Major : Computer Science

join() is the inverse of split(). It reassembles a list of strings into one string with a chosen separator. replace() and case methods normalise individual field values:

# join() is the inverse of split(): reassemble with a new delimiter
tsv_row: str = "\t".join(parts)
print(f"TSV   : {tsv_row!r}")

# replace(): swap delimiters or fix typos
print(cleaned.replace(",", " |"))

# Case methods
tag: str = "  machine_learning  "
print(tag.strip().replace("_", " ").title())
TSV   : 'Alice Kamau\t2024001\t3.95\tComputer Science'
Alice Kamau  | 2024001  | 3.95  | Computer Science
Machine Learning

Searching & Slicing

Test membership, find positions, and count occurrences, all without writing a loop:

# Searching strings: common in log parsing and feature extraction
log: str = "[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)"

print(f"starts with [ERROR]  : {log.startswith('[ERROR]')}")
print(f"ends with threshold  : {log.endswith('threshold')}")
print(f'contains "loss"      : {"loss" in log}')
print(f'find "epoch"         : index {log.find("epoch")}')
print(f'count of "e"         : {log.count("e")}')
starts with [ERROR]  : True
ends with threshold  : False
contains "loss"      : True
find "epoch"         : index 8
count of "e"         : 6

String slicing (s[start:stop]) extracts a substring by position, using the same syntax as list slicing. rpartition(sep) splits at the last occurrence of sep, returning (before, sep, after), the cleanest way to separate a filename from its extension:

log: str = "[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)"

# Extract structured data from a log line
epoch_part: str = log.split("epoch ")[1].split(":")[0]
print(f"Epoch number : {epoch_part}")

# Slicing: same rules as lists
prefix: str = log[:7]  # '[ERROR]'
body: str = log[9:]
print(f"Prefix : {prefix!r}")
print(f"Body   : {body!r}")

# rpartition(): split at the LAST occurrence of a separator
filename: str = "model_experiment_run_42.parquet"
stem, _, ext = filename.rpartition(".")
print(f"stem={stem!r}  ext={ext!r}")
Epoch number : 42
Prefix : '[ERROR]'
Body   : 'poch 42: validation loss exceeded threshold (loss=1.234)'
stem='model_experiment_run_42'  ext='parquet'
Activity 2 - Parse a Messy Log Line

Goal: Extract the model name, epoch, and loss value from the raw log string below into typed variables.

raw = '  [INFO]  model=RandomForest | epoch=  5 | train_loss=0.3421 | val_loss=0.3812  '

# Expected
model_name  = 'RandomForest'
epoch       = 5
train_loss  = 0.3421
val_loss    = 0.3812

Hint: Use strip(), split(‘|’), and split(‘=’).

raw: str = "  [INFO]  model=RandomForest | epoch=  5 | train_loss=0.3421 | val_loss=0.3812  "

# TODO: parse raw into the variables below
model_name: str = ...
epoch: int = ...
train_loss: float = ...
val_loss: float = ...

# Verify
print(f"{model_name=}  {epoch=}  {train_loss=}  {val_loss=}")
model_name=Ellipsis  epoch=Ellipsis  train_loss=Ellipsis  val_loss=Ellipsis

3. Collections: List

A list is Python’s most versatile built-in container: an ordered, mutable sequence of items of any type.

scores  : list[float] = [78.0, 85.5, 92.0]   # floats
names   : list[str]   = ['Alice', 'Bob']       # strings
mixed   :  list       = [42, 'label', True]    # any types (avoid in practice)

When to use a list: - Order matters: items have a defined first and last position - You need to add, remove, or change elements after creation - You are collecting results in a loop: training losses, processed records, file paths

Key operations at a glance:

Operation Syntax Notes
Index a[i] 0-based; negative counts from end
Slice a[start:stop:step] Returns new list; stop is exclusive
Append a.append(x) Add one item to the end
Extend a.extend(iterable) Add all items from another sequence
Insert a.insert(i, x) Insert before index i
Remove a.remove(x) Remove first occurrence of value x
Pop a.pop(i) Remove & return item at index i (default: last)
Delete del a[i] / del a[i:j] Remove item or slice, returns nothing
Clear a.clear() Remove all items (same as del a[:])
Membership x in a Returns True / False
Length len(a) Number of items
Sort a.sort() / sorted(a) In-place vs new list
Count a.count(x) Occurrences of x
Index a.index(x) Position of first x
Copy a.copy() Shallow independent copy

Key Concept: Ordered & Mutable

A list maintains insertion order and supports in-place modification. Annotate as list[int] (Python 3.9+, no import needed).
Full reference: docs.python.org: 5.1 More on Lists

Common Mistake: Assignment Is Not a Copy

b = a makes b point to the same list. Mutating b also changes a.
Use b = a.copy() or b = a[:] for an independent copy.

# Quiz scores for a cohort of students
quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]

# Indexing: 0-based; negative index counts from the end
print(f"First  : {quiz_scores[0]}")
print(f"Last   : {quiz_scores[-1]}")
print(f"[1:4]  : {quiz_scores[1:4]}")
print(f"[::2]  : {quiz_scores[::2]}")  # every other element

# Aggregates
n: int = len(quiz_scores)
mean: float = sum(quiz_scores) / n
print(f"n={n}  min={min(quiz_scores)}  max={max(quiz_scores)}  mean={mean:.1f}")
First  : 78.0
Last   : 81.0
[1:4]  : [85.5, 92.0, 88.5]
[::2]  : [78.0, 92.0, 95.0, 81.0]
n=7  min=67.0  max=95.0  mean=83.9

Slicing

A slice extracts a sub-list without modifying the original. The syntax is a[start : stop : step]:

Part Default Meaning
start 0 Index to begin from (inclusive)
stop len(a) Index to stop at (exclusive: this element is NOT included)
step 1 How many positions to advance each time
a = [10, 20, 30, 40, 50]
a[1:4]    # [20, 30, 40]   - stop=4 is excluded
a[:3]     # [10, 20, 30]   - start defaults to 0
a[2:]     # [30, 40, 50]   - stop defaults to end
a[::2]    # [10, 30, 50]   - every second element
a[::-1]   # [50, 40, 30, 20, 10] - reversed
a[:]      # [10, 20, 30, 40, 50] - full copy (shallow)

Slicing never raises an IndexError. Out-of-range start/stop are clamped silently.

quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]

# Basic slices
print(f"First 3     : {quiz_scores[:3]}")  # [78.0, 85.5, 92.0]
print(f"Last 3      : {quiz_scores[-3:]}")  # [95.0, 67.0, 81.0]
print(f"Middle      : {quiz_scores[2:5]}")  # [92.0, 88.5, 95.0]

# Step
print(f"Every 2nd   : {quiz_scores[::2]}")  # [78.0, 92.0, 95.0, 81.0]
print(f"Reversed    : {quiz_scores[::-1]}")

# Shallow copy via slice
copy_via_slice: list[float] = quiz_scores[:]
copy_via_slice[0] = 0.0
print(f"Original[0] : {quiz_scores[0]}")  # unchanged: 78.0
First 3     : [78.0, 85.5, 92.0]
Last 3      : [95.0, 67.0, 81.0]
Middle      : [92.0, 88.5, 95.0]
Every 2nd   : [78.0, 92.0, 95.0, 81.0]
Reversed    : [81.0, 67.0, 95.0, 88.5, 92.0, 85.5, 78.0]
Original[0] : 78.0

= copies the reference, not the data. Both names then point to the same list in memory. Confirm the difference between a reference and an independent copy:

# Copy vs reference: a critical distinction
quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]

backup: list[float] = quiz_scores.copy()  # independent copy
ref: list[float] = quiz_scores  # same object!
quiz_scores[0] = 99.0

print("After quiz_scores[0] = 99.0:")
print(f"  quiz_scores[0] : {quiz_scores[0]}")
print(f"  ref[0]         : {ref[0]}")  # also changed: same object
print(f"  backup[0]      : {backup[0]}")  # unchanged: independent copy
After quiz_scores[0] = 99.0:
  quiz_scores[0] : 99.0
  ref[0]         : 99.0
  backup[0]      : 78.0

Modifying Lists

Mutability means a value can be changed after it is created. A list is mutable: you can add, remove, or replace any element at any time, without creating a new list. This is unlike strings and tuples, which are immutable: once created, their contents cannot change.

Type Mutable? What it means
list Yes Change any element, add or remove items freely: scores[0] = 99
str No Methods like .upper() return a new string; the original is untouched
tuple No Elements are fixed at creation and cannot be reassigned

Because lists are mutable, the methods below modify the original list in place and return None, not a new list.

scores: list[float] = [85.0, 92.0, 78.0, 65.0, 88.0]

#: Adding items --
scores.append(95.0)  # add one item to the end       [85, 92, 78, 65, 88, 95]
scores.insert(1, 90.0)  # insert 90.0 before index 1
scores.extend([81.5, 76.0])  # add all items from another list

#: Removing items --
scores.remove(65.0)  # remove first occurrence of 65.0 (raises ValueError if absent)
last = scores.pop()  # remove and return last item
second = scores.pop(1)  # remove and return item at index 1
del scores[0]  # remove item at index 0 (no return value)
# del scores[1:3]             # delete a slice: removes multiple items at once

#: Membership test --
print(f"95.0 in scores   : {95.0 in scores}")  # True / False
print(f"999.0 in scores  : {999.0 in scores}")

print(f"scores : {scores}")
print(f"popped : last={last}, second={second}")
95.0 in scores   : True
999.0 in scores  : False
scores : [92.0, 78.0, 88.0, 95.0, 81.5]
popped : last=76.0, second=90.0

List as a Stack & clear()

A stack is a Last-In, First-Out (LIFO) structure: the last item appended is the first one popped. Lists implement this naturally with append() + pop().

clear() removes all items from the list in place (equivalent to del a[:]).

# List as a LIFO stack: useful for depth-first search, undo history, backtracking
call_stack: list[str] = []

# Push
call_stack.append("load_data")
call_stack.append("clean_data")
call_stack.append("train_model")
print(f"Stack (top is last) : {call_stack}")

# Pop (LIFO order)
while call_stack:
    task = call_stack.pop()
    print(f"  Processing: {task}")

print(f"Stack after popping : {call_stack}")

# clear(): empty a list in place (the name 'call_stack' still exists)
call_stack.extend(["task_a", "task_b", "task_c"])
call_stack.clear()
print(f"After clear()       : {call_stack}")  # []
Stack (top is last) : ['load_data', 'clean_data', 'train_model']
  Processing: train_model
  Processing: clean_data
  Processing: load_data
Stack after popping : []
After clear()       : []

sorted() returns a new sorted list; .sort() modifies the list in place and returns None. Assigning the result of .sort() is a common silent bug:

# sorted() returns a new list; .sort() modifies in place
ascending: list[float] = sorted(scores)
descending: list[float] = sorted(scores, reverse=True)
print(f"asc    : {ascending}")
print(f"desc   : {descending}")

# Search
print(f"count of 85.0 : {scores.count(85.0)}")
print(f"index of 92.0 : {scores.index(92.0)}")
asc    : [78.0, 81.5, 88.0, 92.0, 95.0]
desc   : [95.0, 92.0, 88.0, 81.5, 78.0]
count of 85.0 : 0
index of 92.0 : 0
Activity 3 - Summarise a Score List

Goal: Given the raw scores below, produce a cleaned, sorted list and a summary string.

raw = [91.0, None, 74.5, 88.0, None, 63.0, 95.5, 80.0]

# Expected output
clean = [63.0, 74.5, 80.0, 88.0, 91.0, 95.5]   # sorted, None removed
summary = 'n=6  min=63.0  max=95.5  mean=82.0'

Hint: Filter with a list comprehension, then use sorted().

raw: list[float | None] = [91.0, None, 74.5, 88.0, None, 63.0, 95.5, 80.0]

# TODO: build clean (filtered + sorted) and print summary
clean: list[float] = ...
print(f"clean   : {clean}")
print(f"n={len(clean)}  min={min(clean)}  max={max(clean)}  mean={sum(clean) / len(clean):.1f}")
clean   : Ellipsis
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 6
      4 clean: list[float] = ...
      5 print(f"clean   : {clean}")
----> 6 print(f"n={len(clean)}  min={min(clean)}  max={max(clean)}  mean={sum(clean) / len(clean):.1f}")

TypeError: object of type 'ellipsis' has no len()

4. Collections: Tuple & NamedTuple

A tuple is an ordered, immutable sequence, similar to a list, but its contents are fixed at creation. You cannot add, remove, or change any element.

Immutable means locked. Once you write coords = (1.29, 36.82), those two numbers cannot be replaced. This is intentional: immutability makes tuples safe to use as dictionary keys, pass between functions, and share across threads without risk of accidental modification.

When to use a tuple: - The number of elements is fixed by design (a coordinate pair is always 2 values) - Returning multiple values from a function (Python packs them into a tuple) - You need a hashable key for a dict or set (lists cannot be dict keys) - Signalling to a reader that this data must not change

Key operations at a glance:

Operation Syntax Notes
Index t[i] Same as list; negative index counts from end
Slice t[start:stop:step] Returns a new tuple
Unpack a, b, c = t Assign each element to a name
Extended unpack first, *rest = t *rest collects remaining into a list
Swap a, b = b, a Pythonic; no temporary variable needed
Length len(t) Number of elements
Membership x in t True / False
Count t.count(x) Number of occurrences of x
Find t.index(x) Index of first occurrence of x
Concatenate t1 + t2 Returns a new, longer tuple

Key Concept: Ordered & Immutable

Use a tuple for data that must not change: coordinate pairs, database rows, function return values. Annotate the type of each position: tuple[str, int, float].

typing.NamedTuple adds field names and type hints, giving you a lightweight, typed, self-documenting record with zero runtime overhead over a plain tuple.

Class syntax note: NamedTuple uses the class keyword. Full class mechanics are covered in Part 3. For now, read class Foo(NamedTuple): as “define a named-tuple type called Foo with these fields.”

# Tuple: annotate with the exact types of each position
record: tuple[str, int, float] = ("Alice Kamau", 2024001, 3.95)

# Unpack all elements at once
name, student_id, gpa = record
print(f"{name=}  {student_id=}  {gpa=}")

# Extended unpacking with *
first, *middle, last = (82.0, 91.5, 74.0, 88.0, 95.5)
print(f"{first=}  {middle=}  {last=}")
name='Alice Kamau'  student_id=2024001  gpa=3.95
first=82.0  middle=[91.5, 74.0, 88.0]  last=95.5

Python’s swap idiom packs two values into a tuple and immediately unpacks them in the opposite order, no temporary variable needed. Tuples also enforce immutability at runtime:

# Pythonic variable swap: no temp variable needed
x, y = "train", "val"
x, y = y, x
print(f"After swap: {x=}  {y=}")

# Immutability: tuples cannot be changed after creation
record: tuple[str, int, float] = ("Alice Kamau", 2024001, 3.95)
try:
    record[0] = "Bob"  # type: ignore[index]
except TypeError as exc:
    print(f"Immutable: {exc}")
After swap: x='val'  y='train'
Immutable: 'tuple' object does not support item assignment

NamedTuple: Named, Typed Fields

NamedTuple gives a plain tuple field names and type annotations. It uses class syntax (see the note in the section header). For now, read this as “create a named tuple type with these typed fields”:

from typing import NamedTuple


class StudentRecord(NamedTuple):
    """Typed, immutable student record."""

    name: str
    student_id: int
    gpa: float
    major: str = "Undeclared"  # field with default value

Create instances by calling the class like a function. __repr__ is generated automatically: field names appear in the output:

alice = StudentRecord("Alice Kamau", 2024001, 3.95, "Computer Science")
bob = StudentRecord("Bob Mwangi", 2024002, 3.45)  # uses default major

print(alice)
print(bob)
StudentRecord(name='Alice Kamau', student_id=2024001, gpa=3.95, major='Computer Science')
StudentRecord(name='Bob Mwangi', student_id=2024002, gpa=3.45, major='Undeclared')

Access fields by name for readability or by index for tuple-compatible tools. _replace() returns a new record with selected fields updated. The original is immutable and unchanged:

# Access by name (readable) or by index (tuple-compatible)
print(f"{alice.name}, GPA: {alice.gpa}")
print(f"By index alice[2]: {alice[2]}")

# _replace() creates a new record with selected fields changed
alice_updated = alice._replace(gpa=3.97)
print(f"Updated: {alice_updated}")

# NamedTuples unpack just like plain tuples
name, sid, gpa, major = alice
print(f"Unpacked: {name}, {major}")
Alice Kamau, GPA: 3.95
By index alice[2]: 3.95
Updated: StudentRecord(name='Alice Kamau', student_id=2024001, gpa=3.97, major='Computer Science')
Unpacked: Alice Kamau, Computer Science

5. Collections: Dict

A dictionary (dict) maps unique keys to values. Think of it as a lookup table: given a key, you get back its associated value in O(1) time: instantly, regardless of how many entries the dict contains.

Unlike a list (where you access items by numeric position), a dict lets you access data by a meaningful label:

student = {'name': 'Alice', 'gpa': 3.95, 'enrolled': True}
student['gpa']       # 3.95  - by label, not by position
student.get('age')   # None  - safe access, no KeyError

When to use a dict: - Access by name: student record, model config, API response payload - Counting occurrences: {'cat': 3, 'dog': 1, 'bird': 2} - Grouping: {course_id: [student, student, ...]}

Python 3.7+ dicts preserve insertion order: you get keys back in the order you added them.

Key operations at a glance:

Operation Syntax Notes
Access d[key] Raises KeyError if key is missing
Safe access d.get(key, default) Returns default (or None) if key missing
Add / update d[key] = value Creates key if absent; overwrites if present
Bulk update d.update(other) Merge another dict or iterable of pairs
Remove d.pop(key) Remove and return value; KeyError if absent
Remove (safe) d.pop(key, default) Returns default instead of raising
Delete del d[key] Remove key in place; no return value
Clear d.clear() Remove all pairs; dict remains (now empty)
Membership key in d Checks keys only, O(1)
Keys d.keys() Live view of all keys
Values d.values() Live view of all values
Pairs d.items() Live view of (key, value) tuples, used in for loops
Length len(d) Number of key-value pairs
Merge (3.9+) a \| b New merged dict; right side wins on conflicts
Merge in-place a \|= b Update a with b in place
Copy d.copy() Shallow independent copy

Key Concept: Key-Value Map

A dict maps unique, hashable keys to values. Insertion order is preserved (Python 3.7+). Use dict[str, float] to annotate key and value types.

TypedDict (Python 3.8+) defines a typed schema for a dict, essential for model configs and API payloads where every key and its type must be known.

# Course record as a dict
course: dict[str, object] = {
    "code": "CS301",
    "title": "Machine Learning",
    "credits": 3,
    "enrollment": 42,
    "pass_rate": 0.87,
}

# Access: [] raises KeyError on missing key; .get() returns a default
print(course["title"])
print(course.get("lab_room", "TBA"))

# Membership checks keys
print(f'"pass_rate" in course  : {"pass_rate" in course}')
print(f'"semester" in course   : {"semester" in course}')
Machine Learning
TBA
"pass_rate" in course  : True
"semester" in course   : False

Modifying a Dict

Dicts are mutable: you can add, change, and remove keys after creation. .pop() removes a key and returns its value. .items() gives (key, value) pairs for iteration (for loops are covered in Part 2):

# Add / update / remove
course["lab_room"] = "Lab 3A"
course.update({"enrollment": 45, "semester": "Fall 2024"})
semester = course.pop("semester")  # remove and return

# Iterate over all key-value pairs
for key, value in course.items():
    print(f"  {key:<12} : {value}")
  code         : CS301
  title        : Machine Learning
  credits      : 3
  enrollment   : 45
  pass_rate    : 0.87
  lab_room     : Lab 3A

Dict Merge (Python 3.9+)

a | b creates a new merged dict; the right-hand side wins on key conflicts. a |= b merges b into a in place. This replaces the older {**a, **b} pattern:

# Python 3.9+ dict merge operator | and |=
# Replaces the older {**a, **b} pattern: cleaner and faster

default_config: dict[str, object] = {
    "learning_rate": 0.001,
    "epochs": 10,
    "batch_size": 32,
    "optimizer": "adam",
}

run_overrides: dict[str, object] = {
    "epochs": 50,  # override
    "batch_size": 64,  # override
    "dropout": 0.2,  # new key
}

# | creates a NEW merged dict; right side wins on key conflicts
run_config = default_config | run_overrides
print("Merged run config:")
for k, v in run_config.items():
    print(f"  {k:<16}: {v}")

# |= updates the dict in place
default_config |= {"weight_decay": 1e-4}
print(f"\ndefault_config after |=: {default_config}")
Merged run config:
  learning_rate   : 0.001
  epochs          : 50
  batch_size      : 64
  optimizer       : adam
  dropout         : 0.2

default_config after |=: {'learning_rate': 0.001, 'epochs': 10, 'batch_size': 32, 'optimizer': 'adam', 'weight_decay': 0.0001}

TypedDict: Typed Schema for a Dict

TypedDict defines which keys a dict must have and the type of each value. It uses class syntax (see section header note). At runtime it is a plain dict with zero overhead. The schema is enforced only by the type checker:

from typing import TypedDict


class ModelConfig(TypedDict):
    learning_rate: float
    epochs: int
    batch_size: int
    optimizer: str


class ExperimentResult(TypedDict):
    run_id: str
    accuracy: float
    val_loss: float

Annotate a variable with your TypedDict class. The type checker flags wrong key names or value types. type(config) at runtime confirms it is simply a dict:

# TypedDict is a plain dict at runtime: no overhead
# ty checks that keys and value types match the schema
config: ModelConfig = {
    "learning_rate": 0.001,
    "epochs": 50,
    "batch_size": 32,
    "optimizer": "adam",
}

result: ExperimentResult = {
    "run_id": "exp-2024-001",
    "accuracy": 0.923,
    "val_loss": 0.218,
}

print(f"Config : {config}")
print(f"Result : {result}")
print(f"Accuracy: {result['accuracy']:.1%}")
print(f"type(config): {type(config)}")
Config : {'learning_rate': 0.001, 'epochs': 50, 'batch_size': 32, 'optimizer': 'adam'}
Result : {'run_id': 'exp-2024-001', 'accuracy': 0.923, 'val_loss': 0.218}
Accuracy: 92.3%
type(config): <class 'dict'>
Activity 4 - Merge Experiment Configs

Goal: Use the | operator to produce a final run config where overrides wins on conflicts, then add a run_id key.
base = {'lr': 0.01, 'epochs': 5, 'optimizer': 'sgd'}
overrides = {'lr': 0.001, 'epochs': 20}

# Expected
final = {'lr': 0.001, 'epochs': 20, 'optimizer': 'sgd', 'run_id': 'run-001'}
base: dict[str, object] = {"lr": 0.01, "epochs": 5, "optimizer": "sgd"}
overrides: dict[str, object] = {"lr": 0.001, "epochs": 20}

# TODO: merge and add run_id
final: dict[str, object] = ...
print(f"final: {final}")
final: Ellipsis

6. Collections: Set

A set is an unordered collection of unique values. Duplicates are discarded automatically. You never need to deduplicate manually.

Two properties make sets special:

  1. Uniqueness: every value appears at most once, always
  2. O(1) membership testing: x in my_set takes the same time whether the set has 10 or 10,000,000 items. The equivalent x in my_list slows down linearly.

When to use a set: - Removing duplicates from a list: unique = set(my_list) - Fast membership check: if label in valid_labels: - Data pipeline integrity: find overlap or difference between train/val/test IDs

Key operations at a glance:

Operation Syntax / Method Notes
Create {1, 2, 3} or set(iterable) {} creates a dict, use set() for empty
Add s.add(x) No effect if x already present
Remove s.remove(x) Raises KeyError if x absent
Remove (safe) s.discard(x) No error if x absent
Pop s.pop() Remove and return an arbitrary element
Clear s.clear() Remove all elements
Membership x in s O(1), instant regardless of set size
Length len(s) Number of elements
Union s \| t or s.union(t) All elements from both sets
Intersection s & t or s.intersection(t) Elements present in both
Difference s - t or s.difference(t) In s but not in t
Symmetric diff s ^ t or s.symmetric_difference(t) In one but not both
Subset s <= t or s.issubset(t) Every element of s is in t
Superset s >= t or s.issuperset(t) Every element of t is in s
Disjoint s.isdisjoint(t) No elements in common
Immutable copy frozenset(s) Immutable set, can be used as a dict key

Key Concept: Unique Values & O(1) Lookup

A set never stores duplicates and tests membership in constant time. Annotate as set[str]. For an immutable, hashable set that can be used as a dict key, use frozenset.

Common Mistake: {} Is a Dict, Not a Set

empty = {} creates an empty dict.
empty = set() creates an empty set.
This trips up nearly every Python learner once. Now you know.

# Sets remove duplicates on creation
raw_labels: list[str] = ["cat", "dog", "cat", "bird", "dog", "cat"]
unique_labels: set[str] = set(raw_labels)
print(f"raw    : {raw_labels}")
print(f"unique : {sorted(unique_labels)}")

# O(1) membership test: much faster than list for large collections
valid_formats: set[str] = {"parquet", "csv", "json", "feather"}
print(f"parquet valid : {'parquet' in valid_formats}")
print(f"xlsx valid    : {'xlsx' in valid_formats}")

# Mutation
valid_formats.add("orc")
valid_formats.discard("feather")  # safe: no error if element is absent
print(f"formats : {sorted(valid_formats)}")
raw    : ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']
unique : ['bird', 'cat', 'dog']
parquet valid : True
xlsx valid    : False
formats : ['csv', 'json', 'orc', 'parquet']

Confirm the {} gotcha by running this cell. The type output makes it unmistakable:

# GOTCHA: {} creates a dict, not a set: always use set() for an empty set
empty_dict = {}
empty_set = set()
print(f"type({{}})   : {type(empty_dict)}")
print(f"type(set()) : {type(empty_set)}")
type({})   : <class 'dict'>
type(set()) : <class 'set'>

Set Algebra

Sets support mathematical operations directly with operators. These are invaluable for data-pipeline integrity checks such as detecting train/validation leakage:

# Set algebra: very common in data pipeline checks
train_ids: set[int] = {101, 102, 103, 104, 105, 106, 107, 108}
val_ids: set[int] = {107, 108, 109, 110}

print(f"Union        : {sorted(train_ids | val_ids)}")
print(f"Intersection : {sorted(train_ids & val_ids)}")
print(f"Difference   : {sorted(train_ids - val_ids)}")
print(f"Sym. diff    : {sorted(train_ids ^ val_ids)}")

# Practical: check for data leakage between splits
leakage: set[int] = train_ids & val_ids
if leakage:
    print(f"\nWARNING: {len(leakage)} IDs in both train and val : data leakage! {leakage}")
else:
    print("\nNo data leakage between splits.")
Union        : [101, 102, 103, 104, 105, 106, 107, 108, 109, 110]
Intersection : [107, 108]
Difference   : [101, 102, 103, 104, 105, 106]
Sym. diff    : [101, 102, 103, 104, 105, 106, 109, 110]

WARNING: 2 IDs in both train and val : data leakage! {107, 108}

7. Standard Library Collections

Python ships with a large collection of ready-to-use modules called the standard library: available without any pip install. The collections module contains specialised containers that solve common data patterns more cleanly than plain list and dict.

Key Concept: Specialised Containers from collections

Three tools from the standard library cover the most common data-science patterns beyond the built-in types:
  • Counter: count occurrences; perfect for label frequencies and class imbalance checks
  • defaultdict: group items without writing if key not in d: d[key] = []
  • deque: O(1) append and pop from both ends; ideal for sliding windows in time series
from collections import Counter

# Class imbalance check using Counter
predicted_labels: list[str] = [
    "pass",
    "pass",
    "fail",
    "pass",
    "pass",
    "fail",
    "pass",
    "pass",
    "pass",
    "fail",
    "pass",
    "pass",
]

counts: Counter[str] = Counter(predicted_labels)
print(f"All counts : {counts}")
print(f'"pass"     : {counts["pass"]}')
print(f"Unknown    : {counts['unknown']}")  # returns 0, not KeyError
print(f"Top 2      : {counts.most_common(2)}")
All counts : Counter({'pass': 9, 'fail': 3})
"pass"     : 9
Unknown    : 0
Top 2      : [('pass', 9), ('fail', 3)]

Build a class-distribution report and combine counters from multiple batches using Counter arithmetic: + merges counts, - subtracts (removing zeros):

# Class distribution report
total: int = sum(counts.values())
for label, n in counts.most_common():
    print(f"  {label:<8}: {n:2d}/{total} ({n / total:.1%})")

# Counter arithmetic: combine counts from multiple batches
batch_a: Counter[str] = Counter(["pass", "pass", "fail"])
batch_b: Counter[str] = Counter(["fail", "fail", "pass"])
combined = batch_a + batch_b
print(f"\nCombined batches: {combined}")
  pass    :  9/12 (75.0%)
  fail    :  3/12 (25.0%)

Combined batches: Counter({'pass': 3, 'fail': 3})
Activity 5 - Label Frequency Report

Goal: Use Counter to produce a class distribution report from the labels list below.
labels = ['A','B','A','C','B','A','A','B','C','A','B','A']

# Expected output
A : 6/12 (50.0%)  [##############################]
B : 4/12 (33.3%)  [####################          ]
C : 2/12 (16.7%)  [##########                    ]

Hint: Build the bar with ‘#’ * int(pct * 30).

from collections import Counter

labels: list[str] = ["A", "B", "A", "C", "B", "A", "A", "B", "C", "A", "B", "A"]

# TODO: print a class distribution report
counts: Counter[str] = Counter(labels)
total: int = sum(counts.values())

for label, n in counts.most_common():
    pct = n / total
    bar = ...  # TODO: build the bar string
    print(f"{label} : {n}/{total} ({pct:.1%})  [{bar:<30}]")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[39], line 12
     10 pct = n / total
     11 bar = ...  # TODO: build the bar string
---> 12 print(f"{label} : {n}/{total} ({pct:.1%})  [{bar:<30}]")

TypeError: unsupported format string passed to ellipsis.__format__

defaultdict: Zero-Setup Grouping

defaultdict(factory) calls factory() to create a new default value whenever a missing key is accessed, eliminating the if key not in d: d[key] = [] boilerplate. defaultdict(list) is the standard pattern for grouping (uses a for loop, covered in Part 2):

from collections import defaultdict

students: list[dict[str, object]] = [
    {"name": "Alice", "major": "CS", "gpa": 3.95},
    {"name": "Bob", "major": "Math", "gpa": 3.45},
    {"name": "Carol", "major": "CS", "gpa": 3.88},
    {"name": "Dan", "major": "Math", "gpa": 3.72},
    {"name": "Eve", "major": "CS", "gpa": 3.60},
]

# Group students by major: no 'if key not in d: d[key] = []' needed
by_major: defaultdict[str, list[str]] = defaultdict(list)
for s in students:
    by_major[str(s["major"])].append(str(s["name"]))

print("Students by major:")
for major, names in sorted(by_major.items()):
    print(f"  {major}: {names}")
Students by major:
  CS: ['Alice', 'Carol', 'Eve']
  Math: ['Bob', 'Dan']

The same pattern works for numeric accumulation. defaultdict(float) starts every new key at 0.0, making sum-per-group pipelines one-liners:

# Accumulate GPA sums per major
gpa_total: defaultdict[str, float] = defaultdict(float)
gpa_count: defaultdict[str, int] = defaultdict(int)

for s in students:
    key = str(s["major"])
    gpa_total[key] += float(s["gpa"])  # type: ignore[arg-type]
    gpa_count[key] += 1

print("Average GPA by major:")
for major in sorted(gpa_total):
    print(f"  {major}: {gpa_total[major] / gpa_count[major]:.2f}")
Average GPA by major:
  CS: 3.81
  Math: 3.58

deque: Fixed-Size Rolling Buffer

deque(maxlen=N) discards the oldest element automatically when the buffer is full. This is the standard tool for rolling statistics over time-series streams:

from collections import deque

# Rolling mean with maxlen: oldest element auto-discards when full
temperature_readings: list[float] = [
    36.5,
    36.7,
    37.1,
    37.8,
    38.2,
    38.0,
    37.5,
    37.2,
    36.9,
    36.6,
]
WINDOW: int = 3

window: deque[float] = deque(maxlen=WINDOW)
rolling_means: list[float] = []

for reading in temperature_readings:
    window.append(reading)
    if len(window) == WINDOW:
        rolling_means.append(round(sum(window) / WINDOW, 2))

print(f"Readings      : {temperature_readings}")
print(f"Rolling mean-3: {rolling_means}")
Readings      : [36.5, 36.7, 37.1, 37.8, 38.2, 38.0, 37.5, 37.2, 36.9, 36.6]
Rolling mean-3: [36.77, 37.2, 37.7, 38.0, 37.9, 37.57, 37.2, 36.9]

A deque doubles as a FIFO queue. appendleft() and popleft() are O(1) – far faster than list.insert(0, ...) which is O(n):

from collections import deque

# deque as a task queue: O(1) popleft vs list's O(n)
pipeline: deque[str] = deque(["load_data", "clean", "feature_eng", "train", "evaluate"])
pipeline.appendleft("validate_schema")  # high-priority step prepended

print("Pipeline execution order:")
while pipeline:
    step = pipeline.popleft()
    print(f"  -> {step}")
Pipeline execution order:
  -> validate_schema
  -> load_data
  -> clean
  -> feature_eng
  -> train
  -> evaluate

statistics: Descriptive Stats Without NumPy

The statistics module computes common descriptive statistics on plain Python lists, no NumPy required. Use it for quick sanity checks and lightweight scripts. For large arrays, NumPy is faster (covered in Part 4).

import statistics

exam_scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]

mean = statistics.mean(exam_scores)
median = statistics.median(exam_scores)
stdev = statistics.stdev(exam_scores)  # sample std deviation
pstdev = statistics.pstdev(exam_scores)  # population std deviation
var = statistics.variance(exam_scores)

print(f"n      = {len(exam_scores)}")
print(f"mean   = {mean:.2f}")
print(f"median = {median:.2f}")
print(f"stdev  = {stdev:.2f}   (sample)")
print(f"var    = {var:.2f}")
n      = 8
mean   = 79.75
median = 81.00
stdev  = 11.41   (sample)
var    = 130.21

Use statistics.NormalDist to compute z-scores and check where any value falls in the distribution:

from statistics import NormalDist

dist = NormalDist(mu=mean, sigma=stdev)

for score in [63.0, 77.0, 94.0]:
    z = (score - mean) / stdev
    pct = dist.cdf(score) * 100  # percentile rank
    print(f"  score={score:5.1f}  z={z:+.2f}  percentile={pct:.1f}%")
  score= 63.0  z=-1.47  percentile=7.1%
  score= 77.0  z=-0.24  percentile=40.5%
  score= 94.0  z=+1.25  percentile=89.4%

8. Operators

An operator is a symbol that performs a computation on one or two values. You already know arithmetic operators from mathematics. Python adds several more:

Category Operators Example
Arithmetic + - * / // % ** 7 // 23
Comparison == != < > <= >= score >= 70True
Logical and or not a and b
Assignment expression := (walrus) if (n := len(data)) > 10:

Three operator families matter most in data science work: arithmetic, comparison + logical, and the walrus := (Python 3.8+).

# Weighted grade calculation
midterm: float = 82.0
final: float = 91.0
project: float = 88.0

weighted_grade = midterm * 0.30 + final * 0.50 + project * 0.20
print(f"Weighted grade: {weighted_grade:.1f}")

# Augmented assignment modifies in place
loss: float = 1.0
for _ in range(6):
    loss *= 0.75
print(f"Loss after 6 decay steps: {loss:.4f}")
Weighted grade: 87.7
Loss after 6 decay steps: 0.1780

/ and // are different operators. This is one of the most common Python gotchas: // floors toward negative infinity, not toward zero:

# Division: / is always true division; // is floor (rounds toward -inf)
print(f"7 / 2  = {7 / 2}")  # 3.5 : always float
print(f"7 // 2 = {7 // 2}")  # 3   : floor, not truncate
print(f"7 % 2  = {7 % 2}")  # 1   : remainder
print(f"2**10  = {2**10}")  # 1024: exponentiation
print(f"-7//2  = {-7 // 2}")  # -4  : floors TOWARD negative infinity
7 / 2  = 3.5
7 // 2 = 3
7 % 2  = 1
2**10  = 1024
-7//2  = -4

Comparison & Logical Operators

Comparison operators return bool. Logical operators combine conditions and use short-circuit evaluation: the right side is not evaluated if the left side already determines the result:

# Comparison and logical operators
score: float = 84.5
attendance: int = 90

passes = score >= 70
qualifies = score >= 80 and attendance >= 85  # both must be true
at_risk = score < 60 or attendance < 70  # either triggers
not_pass = not passes

print(f"{passes=}  {qualifies=}  {at_risk=}  {not_pass=}")
passes=True  qualifies=True  at_risk=False  not_pass=False

Short-circuit evaluation prevents errors like dividing by an empty list. Use is/is not to check object identity (same object in memory) and ==/!= to check value equality:

# Short-circuit evaluation: right side is NOT evaluated if left decides outcome
scores: list[float] | None = [82.0, 91.5, 74.0]
mean = scores and sum(scores) / len(scores)  # safe: skips divide if scores is None
print(f"mean (safe): {mean}")

# Identity (is) vs equality (==)
a: list[int] = [1, 2, 3]
b: list[int] = [1, 2, 3]
c: list[int] = a
print(f"a == b : {a == b}")  # True : same values
print(f"a is b : {a is b}")  # False: different objects
print(f"a is c : {a is c}")  # True : same object
mean (safe): 82.5
a == b : True
a is b : False
a is c : True

Walrus Operator := (Python 3.8+)

:= is an assignment expression: unlike = (a statement), it both assigns a value and evaluates to that value. This lets you assign inside a condition, avoiding computing the same value twice:

# Walrus operator := (Python 3.8+)
# Assigns AND returns a value in the same expression.
exam_scores: list[float] = [82.0, 91.5, 74.0, 88.0, 95.5, 64.0, 79.0]

# Without walrus: must store mean manually before using in condition
m = sum(exam_scores) / len(exam_scores)
if m < 80:
    print(f"[without walrus] Cohort average is low: {m:.1f}")

# With walrus: compute once, test, and use: all in one expression
if (m := sum(exam_scores) / len(exam_scores)) < 80:
    print(f"[with walrus]    Cohort average is low: {m:.1f}")

:= is especially useful inside while loops that consume a stream and inside comprehensions that need to reuse a computed intermediate value (both covered in Part 2):

# Walrus in a while loop: consume a stream until None sentinel
data_stream = iter([10.0, 20.0, 30.0, None, 40.0])
while (value := next(data_stream, None)) is not None:
    print(f"  Read: {value}")

# Walrus in a comprehension: strip once, reuse stripped value
raw_names: list[str] = ["  Alice  ", "", "  Bob  ", " ", "  Carol  "]
clean_names: list[str] = [stripped for name in raw_names if (stripped := name.strip())]
print(f"Clean names: {clean_names}")
  Read: 10.0
  Read: 20.0
  Read: 30.0
Clean names: ['Alice', 'Bob', 'Carol']

Further Reading

Resource Why it matters
Python Data Model The official spec for __dunder__ methods and how Python objects work under the hood
VanderPlas, J. (2016). Python Data Science Handbook. O’Reilly. Free at jakevdp.github.io/PythonDataScienceHandbook — the NumPy and pandas chapters build directly on this one
Ramalho, L. (2022). Fluent Python, 2nd ed. O’Reilly. Chapter 2 (sequences) and Chapter 3 (dicts and sets) go deeper than any tutorial; the book treats Python as a first-class design language
PEP 572 — Assignment Expressions Background and rationale for the walrus operator (:=) introduced in Python 3.8

Summary

Concept Key rule
Type hints x: int, list[float], dict[str, int], X \| None, checked by ty but not enforced at runtime
f-strings f'{var=}' for debugging; f'{val:.2f}' for formatting
Strings .strip(), .split(), .join(), .replace() cover most data cleaning
list Ordered, mutable; use .copy() not = when you need independence
tuple / NamedTuple Immutable records; unpack with a, b = t or a, *rest, b = t
dict / TypedDict Key-value; merge with \|; typed schema with TypedDict
set Unique values, O(1) membership; \| union, & intersection, - difference
Counter Frequency counts; .most_common(n)
defaultdict Group items without KeyError; defaultdict(list)
deque Sliding windows; maxlen= auto-drops oldest
Walrus := Assign inside a condition to avoid re-computing

Next: 02-control-flow.ipynb, covering if/elif/else, match/case, for, while, and comprehensions.