Data Analytics Academy

Day 3 of 5 · 4 hours

Python — pandas for analysis {: .hero-day__title }

Excel with code. Load, filter, group, join, and chart 100K rows in a notebook you can re-run tomorrow.

Duration: 4 hours Prerequisites: Days 1–2 Learning goals: by end of day you can load CSVs into pandas DataFrames, filter/group/join them, and make quick plots. Enough to do everything you did in Excel and SQL — faster, repeatable, and on much bigger data.

Important framing: today is not about learning Python the programming language. Today is about pandas — “Excel with code.” We use just enough Python syntax to drive pandas, and skip everything else.

Why pandas after SQL?

SQL is great until you want to:

  • Loop over many similar queries and compose results
  • Run statistical tests, fit a regression, build a quick model
  • Make a chart that’s better than the database GUI’s built-in plotter
  • Save your analysis as a re-runnable notebook

pandas does all of those. And under the hood, every df.groupby(...).agg(...) is a SQL GROUP BY — you already know the concepts.

SQLpandas
SELECT a, b FROM tdf[['a', 'b']]
WHERE a > 10df[df['a'] > 10] or df.query('a > 10')
GROUP BY adf.groupby('a')
SUM(b).agg({'b': 'sum'}) or ['b'].sum()
JOIN ... ONdf1.merge(df2, on='id', how='left')
ORDER BYdf.sort_values('col', ascending=False)
LIMIT 10df.head(10)

Tooling: Jupyter

Jupyter notebooks let you mix code, output, and prose in one document. Two ways to run:

  • Browser, zero install: jupyter.org/try-jupyter/lab → JupyterLite. Works offline-ish, no setup. Recommended for the class so everyone is on the same tooling.
  • Local: install Anaconda, run jupyter lab from a terminal. More powerful but more install pain.

Your instructor will tell you which to use Day 3 morning. Stick with that one.

Agenda

TimeBlockTopicLesson
00:00–00:25Hour 1Notebooks, loading data, inspecting DataFramesLesson 1
00:25–00:55Hour 1Selecting & filteringLesson 2
00:55–01:05Break
01:05–01:35Hour 2Aggregation & groupbyLesson 3
01:35–02:00Hour 2Merge & combineLesson 4
02:00–02:10Break
02:10–02:40Hour 3Cleaning, dates, plotsLesson 5
02:40–02:55Self-test12-question gate-checkSelf-test
02:55–03:00Break
03:00–04:00Hour 4CapstoneCapstone

What you’ll be able to do

By the end of today, given a folder of CSVs and a question, you can:

  • Load each CSV with proper date parsing and audit it with .info()
  • Filter rows with boolean masks or .query()
  • Group and aggregate with named aggregations
  • Merge several tables with row-count sanity checks at each step
  • Clean missing values, compute day differences between dates, and bin continuous values
  • Save the result to a CSV and produce a quick chart for the report

Lessons

Practice is folded into each lesson as collapsible “Try it yourself” boxes — read the concept, attempt the drill, reveal the solution.

Self-test

When you’ve worked through all five lessons, take the 12-question self-test to confirm you’re capstone-ready. ~15 minutes.

Capstone task for today

../../capstone/day3_python/README.md — reproduce yesterday’s two SQL output tables in pandas, then extend with a per-seller late-vs-on-time review analysis. The PNG you save today gets embedded in Day 5’s final report.

Common pitfalls

  • SettingWithCopyWarning — pandas isn’t sure if you’re modifying a view or a copy. Use .loc[…] assignment or .copy() explicitly.
  • Forgetting parse_dates — your “dates” are strings; sorting/filtering doesn’t work as expected.
  • Comparing strings as numbers'10' < '9' is True. Check df.dtypes.
  • Using and/or instead of &/| in boolean indexing — raises a confusing error.
  • Trusting len() after a merge without asserting. Duplicates on the right inflate everything downstream.