Day 3 of 5 · 4 hours

Python — pandas for analysis {: .hero-day__title }

Excel with code. Load, filter, group, join, and chart 100K rows in a notebook you can re-run tomorrow.

Duration: 4 hours Prerequisites: Days 1–2 Learning goals: by end of day you can load CSVs into pandas DataFrames, filter/group/join them, and make quick plots. Enough to do everything you did in Excel and SQL — faster, repeatable, and on much bigger data.

Important framing: today is not about learning Python the programming language. Today is about pandas — “Excel with code.” We use just enough Python syntax to drive pandas, and skip everything else.

Why pandas after SQL?

SQL is great until you want to:

Loop over many similar queries and compose results
Run statistical tests, fit a regression, build a quick model
Make a chart that’s better than the database GUI’s built-in plotter
Save your analysis as a re-runnable notebook

pandas does all of those. And under the hood, every df.groupby(...).agg(...) is a SQL GROUP BY — you already know the concepts.

SQL	pandas
`SELECT a, b FROM t`	`df[['a', 'b']]`
`WHERE a > 10`	`df[df['a'] > 10]` or `df.query('a > 10')`
`GROUP BY a`	`df.groupby('a')`
`SUM(b)`	`.agg({'b': 'sum'})` or `['b'].sum()`
`JOIN ... ON`	`df1.merge(df2, on='id', how='left')`
`ORDER BY`	`df.sort_values('col', ascending=False)`
`LIMIT 10`	`df.head(10)`

Tooling: Jupyter

Jupyter notebooks let you mix code, output, and prose in one document. Two ways to run:

Browser, zero install: jupyter.org/try-jupyter/lab → JupyterLite. Works offline-ish, no setup. Recommended for the class so everyone is on the same tooling.
Local: install Anaconda, run jupyter lab from a terminal. More powerful but more install pain.

Your instructor will tell you which to use Day 3 morning. Stick with that one.

Agenda

Time	Block	Topic	Lesson
00:00–00:25	Hour 1	Notebooks, loading data, inspecting DataFrames	Lesson 1
00:25–00:55	Hour 1	Selecting & filtering	Lesson 2
00:55–01:05	Break
01:05–01:35	Hour 2	Aggregation & groupby	Lesson 3
01:35–02:00	Hour 2	Merge & combine	Lesson 4
02:00–02:10	Break
02:10–02:40	Hour 3	Cleaning, dates, plots	Lesson 5
02:40–02:55	Self-test	12-question gate-check	Self-test
02:55–03:00	Break
03:00–04:00	Hour 4	Capstone	Capstone

What you’ll be able to do

By the end of today, given a folder of CSVs and a question, you can:

Load each CSV with proper date parsing and audit it with .info()
Filter rows with boolean masks or .query()
Group and aggregate with named aggregations
Merge several tables with row-count sanity checks at each step
Clean missing values, compute day differences between dates, and bin continuous values
Save the result to a CSV and produce a quick chart for the report

Lessons

Lesson 01 ~25 min Notebooks & loading data `read_csv` with `parse_dates`, the 30-second inspection routine: `.info()`, `.head()`, `.describe()`. Lesson 02 ~30 min Selecting & filtering Boolean masks, `.query()`, `&`/`|`/`~` with parentheses, taming `SettingWithCopyWarning`. Lesson 03 ~30 min Aggregation & groupby `.value_counts()`, named aggregations, `.size()` vs `.count()`, sort + trim chains. Lesson 04 ~25 min Merge & combine `pd.merge`, the row-count sanity-check pattern, `indicator=True`, when `concat` is the right tool. Lesson 05 ~30 min Cleaning, dates, plots `.isna()`/`.fillna()`, the `.dt`/`.str` accessors, `pd.cut`, quick plots, saving CSV + PNG. Self-test ~15 min Gate-check before the capstone 12 questions covering the five lessons. Reveal each answer only after writing yours down.

Practice is folded into each lesson as collapsible “Try it yourself” boxes — read the concept, attempt the drill, reveal the solution.

Self-test

When you’ve worked through all five lessons, take the 12-question self-test to confirm you’re capstone-ready. ~15 minutes.

Capstone task for today

../../capstone/day3_python/README.md — reproduce yesterday’s two SQL output tables in pandas, then extend with a per-seller late-vs-on-time review analysis. The PNG you save today gets embedded in Day 5’s final report.

Common pitfalls

SettingWithCopyWarning — pandas isn’t sure if you’re modifying a view or a copy. Use .loc[…] assignment or .copy() explicitly.
Forgetting parse_dates — your “dates” are strings; sorting/filtering doesn’t work as expected.
Comparing strings as numbers — '10' < '9' is True. Check df.dtypes.
Using and/or instead of &/| in boolean indexing — raises a confusing error.
Trusting len() after a merge without asserting. Duplicates on the right inflate everything downstream.