Part 16: Git and GitHub for Data Science

View Source on GitHub

DS-MLOps Dev Tools

Python 3.12+ | Author: Anthony Faustine

Before you begin

This chapter assumes you have completed Part 13 through Part 15. The grade-predictor project should have typed, ruff-clean code. This chapter versions that project with git and connects it to GitHub.

If you have used git before, this chapter fills in the DS-specific gaps: what to keep out of git, how to name commits so they serve as a searchable log, and how to structure branches for ML experiments.

Topics covered

Topic	Why it matters
Three-state model	Working tree, staging area, commit history as one mental model
Commits and history	A good commit log is a searchable record of every decision
Branching for ML experiments	Isolate experiments from stable code without copying files
`.gitignore` for DS	Keep data files, model weights, and secrets out of the repo
GitHub remote workflow	Push, pull, and collaborate via pull requests
Conventional Commits	Standardised messages that generate changelogs automatically

Callout markers used throughout this chapter are explained on the book cover page.

Learning Objectives

By the end of Part 16 you will be able to:

#	Skill	Covered in
0	Install git and configure it with your identity and a remote host	Sec. 0
1	Initialize a git repository and write a correct `.gitignore` for a DS project	Sec. 1
2	Understand the three-state model: working tree, staging area, commit history	Sec. 2
3	Write conventional commits that serve as a searchable project log	Sec. 3
4	Use branches to run experiments without breaking working code	Sec. 4
5	Open a pull request and understand what good PR descriptions contain	Sec. 5
6	Write a GitHub Actions workflow that runs tests on every push	Sec. 6

0. Installing and Configuring Git

Install git

Git does not ship with every operating system. Check if it is already present:

git --version
# git version 2.49.0

If it is not installed:

# macOS: via Homebrew (recommended)
brew install git

# macOS alternative: install Xcode Command Line Tools (includes git)
xcode-select --install

# Ubuntu / Debian
sudo apt update && sudo apt install git

# Windows: download the installer from the official site
# https://git-scm.com/download/win
# Git for Windows includes Git Bash, a Unix-style terminal, which is recommended

First-time identity configuration

Every commit records the author’s name and email. Set them once, globally, before making any commit:

git config --global user.name "Your Name"
git config --global user.email "you@example.com"
git config --global core.editor "code --wait"   # use VS Code as the commit editor
git config --global init.defaultBranch main

Verify the result:

git config --list --global

Key Concept: Global vs local git config

–global writes to ~/.gitconfig and applies to every repository on the machine. Omitting it writes to .git/config inside the current repository only. Use –global for your identity; use local config to override settings for a specific project (e.g., a work email for a client repo).

Connect to GitHub (or GitLab)

You need an account on a remote hosting service to push code and collaborate. GitHub and GitLab are the two most common choices for DS projects:

	GitHub	GitLab
Free private repos	Yes (unlimited)	Yes (unlimited)
CI/CD included	GitHub Actions	GitLab CI
Package registry	GitHub Packages	GitLab Container Registry
Best for	Open source, community	Self-hosted, enterprise, full DevOps

The fastest way to authenticate is an SSH key. Generate one and add it to your account:

# Generate an Ed25519 key (replace with your email)
ssh-keygen -t ed25519 -C "you@example.com"

# Start the SSH agent and add the key
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

# Print the public key and copy it into GitHub / GitLab
cat ~/.ssh/id_ed25519.pub

On GitHub: Settings → SSH and GPG keys → New SSH key → paste the public key.

Source: GitHub Docs: Adding a new SSH key

Alternatively, use the GitHub CLI (gh) to authenticate without manually copying keys:

# Install gh
brew install gh           # macOS
sudo apt install gh       # Ubuntu

# Authenticate (follows a browser-based flow)
gh auth login

Test the connection:

ssh -T git@github.com
# Hi username! You've successfully authenticated.

Activity 0 - Install and Configure git

Goal: Confirm git is installed and configured. Run git –version, set your user.name and user.email, and create a GitHub or GitLab account if you do not already have one. Add an SSH key and confirm the connection with ssh -T git@github.com.

git --version
git config --global user.name "Your Name"
git config --global user.email "you@example.com"
ssh -T git@github.com

1. What Git Tracks and What It Must Not

Git is a time machine for code. It is not a time machine for data, models, or secrets. The single most important setup decision for a DS project is getting .gitignore right before the first commit, because a file committed once is in git history permanently.

A DS .gitignore covers five categories:

# Python runtime
.venv/
__pycache__/
*.pyc
.pytest_cache/
dist/
*.egg-info/

# Secrets: never in git, ever
.env
*.env
.env.local

# Data: too large for git; use cloud storage or DVC
*.csv
*.parquet
*.pkl
*.h5
data/raw/
data/processed/

# Model artifacts
models/
*.pt
*.onnx
*.joblib

# Jupyter
.ipynb_checkpoints/
*-checkpoint.ipynb

# IDE
.idea/
.vscode/settings.json

Add this to grade-predictor/.gitignore before running git init.

Key Concept: Git tracks code. Data and secrets belong elsewhere.

A 2GB CSV committed by accident is hard to remove cleanly: it lives in history even after git rm and requires rewriting history to fully erase. Secrets committed to a public repo are searchable and must be rotated immediately. The safest habit is to set up .gitignore and run git status before every first commit in a new project.

2. The Three-State Model

Every file in a git project lives in one of three states: the working tree (what is on disk right now), the staging area (what will go into the next commit), and the commit history (what is permanent). Understanding these three states is what separates confident git use from “type commands and hope”.

git init
git status                                         # see all three states
git add src/grade_predictor/core.py                # move specific file to staging
git status                                         # confirm it moved
git commit -m "feat(core): add compute_grade function"
git log --oneline                                  # inspect history

flowchart LR
    W["Working Tree\non disk"] -->|"git add file"| S["Staging Area\n(index)"]
    S -->|"git commit -m '...'"| H["Commit History\n(permanent)"]
    H -->|"git restore file"| W
    S -->|"git restore --staged file"| W

    style H fill:#EBF5F0,stroke:#059669,color:#065F46
    style S fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E
    style W fill:#F5F3FF,stroke:#7C3AED,color:#3B0764

The flow: make changes in the working tree, select which changes belong in this commit with git add, then seal them into history with git commit. Changes left unstaged are visible in the working tree but invisible to the next commit.

git diff shows what is in the working tree but not yet staged. git diff --staged shows what is staged but not yet committed. Running both before committing is a habit worth building.

Common Mistake: git add . adds everything

git add . stages every changed file in the current directory, including .env, CSV files, and anything else that slipped past .gitignore. Always run git status before git add. Stage specific files by name: git add src/grade_predictor/core.py. Use git add -p to stage hunks interactively when a file has multiple independent changes.

Activity 1 - First Commit

Goal: Initialize a git repo in grade-predictor. Add .gitignore first. Then run git status and confirm .env and .venv/ do not appear in the untracked list. Stage and commit only the source files.

git init
git status              # .env and .venv should NOT appear here
git add pyproject.toml src/ .gitignore
git commit -m "feat: initial grade-predictor project setup"

3. Conventional Commits

A git log is only useful if the messages are readable. git log --oneline on a project with messages like “fix”, “update”, “wip”, and “changes” tells you nothing. Conventional commits solve this by imposing a consistent format: type(scope): description.

Type	When to use
`feat`	New capability: a new model, a new pipeline step, a new analysis function
`fix`	Corrects a bug: wrong normalization, off-by-one in a split, a missing fillna
`refactor`	Restructures code without changing behavior: extract a function, rename
`test`	Adds or updates tests only
`docs`	Documentation only: docstrings, README, notebook prose
`chore`	Tooling: update dependencies, fix CI, update lockfile
`data`	Data changes: new dataset version, updated schema, changed preprocessing

Real commit messages from a DS project log:

feat(model): add gradient boosting baseline, 5-fold CV accuracy=0.84
fix(preprocessing): normalize features before train/test split, not after
data(university): add 2025 cohort, student_id format unchanged
refactor(core): extract grade_to_letter from compute_grade
test(core): parametrize grade boundary tests for all five letter grades
chore: bump scikit-learn to 1.7, update uv.lock

Each message answers: what changed and why does it matter? Two months from now, git log --oneline with messages like these tells the whole project story.

commitizen enforces this format automatically. See Part 18 for setup. For now, write the messages by hand.

Pro Tip: The scope is optional but valuable

fix: normalize features is fine. fix(preprocessing): normalize features before split is searchable. git log –oneline –grep=“fix(preprocessing)” finds every bug fix in the preprocessing layer in one command. This matters when a production issue arrives at 2am and you need to trace exactly when the normalization logic changed.

Activity 2 - Write Three Commits

Goal: Make three separate conventional commits to grade-predictor: one feat for adding a function to core.py, one test for adding a test, and one docs for updating the README. Then run git log –oneline and confirm all three messages follow the format.

4. Branches for Experiments

The main branch of a project should always be in a working state. New features, experiments, and bug fixes belong on separate branches. In DS work, this applies especially to model experiments: a branch per experiment means you can run them in parallel, compare results, and abandon a dead end without any cleanup.

git checkout -b experiment/ridge-regression    # create and switch in one step
# ...make changes...
git add src/
git commit -m "feat(model): ridge regression baseline, RMSE=9.2"

git checkout main
git merge experiment/ridge-regression
git branch -d experiment/ridge-regression      # clean up after merging

gitGraph
    commit id: "feat: initial setup"
    branch experiment/ridge
    checkout experiment/ridge
    commit id: "feat: ridge baseline"
    commit id: "test: parametrize"
    checkout main
    merge experiment/ridge id: "merge: ridge (RMSE=9.2)"
    commit id: "chore: update deps"

Naming conventions that work well for DS projects:

Pattern	Use for
`feature/<name>`	New capabilities: a new pipeline step, a new API
`experiment/<name>`	Model experiments: may be abandoned without guilt
`fix/<name>`	Bug fixes: wrong calculation, broken test
`chore/<name>`	Dependency updates, tooling changes

Pro Tip: Branches track code. MLflow tracks metrics.

A branch captures what code produced a result. MLflow (or a simple results CSV) captures what the result was. Use both together: the branch name references the experiment, the MLflow run name references the same experiment. Two months later you can find both the code and the metric.

Activity 3 - Experiment Branch

Goal: Create a branch experiment/weighted-average. Change the default weights in compute_grade from (0.30, 0.45, 0.25) to (0.25, 0.50, 0.25). Commit the change with a feat(model) message that states the new weights. Merge the branch back to main and delete it.

git checkout -b experiment/weighted-average
# ...edit core.py...
git add src/
git commit -m "feat(model): change weights to 0.25/0.50/0.25, favour final exam"
git checkout main && git merge experiment/weighted-average

5. Pull Requests

A pull request is a request to merge a branch into main, plus a structured conversation about the change. On a solo project, it is still worth opening PRs: the PR description forces you to articulate what changed and why, and the CI check runs automatically.

For DS work, a useful PR description answers:

What changed (one sentence summary)
Why (what problem does this solve or what experiment does this run)
What metrics were observed, if any
What tests were added or updated
Any known limitations

git push -u origin experiment/ridge-regression
gh pr create \
  --title "feat(model): ridge regression baseline" \
  --body "Adds ridge regression (alpha=1.0) as a second baseline. RMSE=9.2 vs linear=10.4 on held-out 20%. Tests added for predict() signature and output shape."

The gh CLI creates the PR without leaving the terminal. CI runs automatically on the push that opens the PR.

6. GitHub Actions for Automated Testing

GitHub Actions is a CI/CD platform that runs jobs on every push or pull request. The job definition is a YAML file in .github/workflows/. Here is a minimal workflow for grade-predictor:

# .github/workflows/test.yml
name: Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v3
        with:
          version: "latest"

      - name: Install dependencies
        run: uv sync --extra test

      - name: Run tests
        run: uv run pytest tests/ --override-ini=addopts=
        env:
          PYTHONWARNINGS: ignore

Step by step:

actions/checkout@v4: clones the repo into the CI runner
astral-sh/setup-uv@v3: installs uv (no Python setup needed; uv handles it)
uv sync --extra test: installs core + test dependencies only, no dev or modelling
uv run pytest tests/: runs tests inside the project environment
--override-ini=addopts=: strips local addopts from pyproject.toml that may reference local paths not present in CI

When the job fails, GitHub shows the full log. Reading CI logs is a skill: look for the first FAILED or Error line, not the summary at the bottom.

Example: Reading a CI log

A failing test produces output like:
FAILED tests/test_core.py::test_compute_grade_defaults - AssertionError: assert 83.5 == 84.25

The important parts: the test file (tests/test_core.py), the test name (test_compute_grade_defaults), and the assertion that failed (83.5 != 84.25). The number tells you which weights were used: 83.5 is the old weights, 84.25 is the expected value with the new ones. This is why the test exists.

Activity 4 - First CI Run

Goal: Create .github/workflows/test.yml in grade-predictor. Push the branch to GitHub. Watch the Actions tab. If CI fails, read the log and fix the issue.

mkdir -p .github/workflows
# create test.yml as above
git add .github/
git commit -m "chore: add GitHub Actions test workflow"
git push -u origin main

Capstone - Version grade-predictor

Bring the full grade-predictor project under version control.

Capstone - From Local to GitHub

Add a proper DS .gitignore covering Python, secrets, data, and IDE files
Initialize a git repository: git init
Make three conventional commits: one for the project setup, one for core.py, one for the test file
Create a new repository on GitHub (without README, without .gitignore)
Push: git remote add origin <url> && git push -u origin main
Add the test workflow. Push and confirm CI runs green

Resource	Why it matters
Conventional Commits specification	The format used throughout this book
commitizen documentation	Enforces and automates conventional commits; covered in Part 18
GitHub Actions for Python	Official guide with uv and pytest examples
gitignore.io	Generator for `.gitignore` by language and IDE
DVC documentation	Data Version Control: the git equivalent for datasets and model weights