Lesson 3 — Aggregation & groupby
Time: ~30 min. You’ll be able to:
- Use
.value_counts()for instant categorical summaries - Group by one or more columns and compute aggregates with
.agg() - Choose between named aggregation and the older dict syntax (named is preferred)
- Tell
.size()from.count() - Sort and trim grouped results
- Use
.reset_index()to get a flat DataFrame back
You’ve already seen this thinking — twice.
!!! tip “Same idea you used twice already”
Day 1’s pivot tables and Day 2’s GROUP BY are this same operation. Drag a category to Rows + a number to Values in Excel ↔ write GROUP BY cat + SUM(num) in SQL ↔ chain .groupby('cat').agg(total=('num','sum')) in pandas. The mental model is identical; only the syntax changes.
.value_counts() — the one-liner you’ll use forever
orders['order_status'].value_counts()
# delivered 96478
# shipped 1107
# canceled 625
# ...
“For each unique value in this column, how many rows have it?” Sorted descending by count, by default. Two useful kwargs:
orders['order_status'].value_counts(normalize=True) # proportions, not counts
orders['order_status'].value_counts().head(5) # top 5 only
Reach for .value_counts() whenever you’d say “what’s the breakdown of…”. Faster than a groupby for the simple case.
.groupby — the same model as SQL
orders.groupby('order_status').size()
Reads as: “for each value of order_status, count the rows.” Exactly like SQL’s GROUP BY order_status + COUNT(*).
The result is a Series indexed by order_status. To get a DataFrame back, .reset_index():
orders.groupby('order_status').size().reset_index(name='n')
.size() vs .count()
.size()— counts rows in each group (always; ignores NaNs)..count()— counts non-NaN values per column per group.
.size() is what you usually want when you mean “how many rows?”. Use .count() when you specifically need “how many non-missing values of column X?”.
.agg() — multiple aggregates at once
orders.groupby('order_status').agg(
n=('order_id', 'count'),
earliest=('order_purchase_timestamp', 'min'),
latest=('order_purchase_timestamp', 'max'),
)
This is named aggregation — output_name=(input_column, aggregation_function). The output is a DataFrame with one row per group and one column per named aggregate.
Aggregation functions you’ll use most:
| Aggregation | What it does |
|---|---|
'count' | Non-null count |
'size' | Row count (use as a column in named agg by including any column name) |
'sum' | Total |
'mean' | Arithmetic mean |
'median' | 50th percentile |
'min' / 'max' | Extremes |
'std' / 'var' | Standard deviation / variance |
'nunique' | Distinct count (like SQL’s COUNT(DISTINCT)) |
'first' / 'last' | First/last row in the group |
You can also pass a custom function: agg(p10=('price', lambda x: x.quantile(0.10))).
The older dict syntax (you’ll see it in tutorials)
orders.groupby('order_status').agg({
'order_id': 'count',
'order_purchase_timestamp': ['min', 'max'],
})
Works but produces a multi-index in the columns, which is annoying. Prefer named aggregation (the name=(col, fn) form) — flat, readable output.
Multi-column groupby
items.groupby(['seller_id', 'product_id']).agg(
total_revenue=('price', 'sum'),
n_orders=('order_id', 'nunique'),
)
Pass a list of column names. One row per unique combination of group keys.
Sort, then trim
groupby results aren’t sorted by default. To get “top 10 sellers by revenue”:
(
items
.groupby('seller_id')
.agg(revenue=('price', 'sum'))
.sort_values('revenue', ascending=False)
.head(10)
)
Notice the method chain on parentheses-wrapped lines — pandas idiom for legible multi-step operations. Each line is one verb: group, aggregate, sort, slice. Reads top to bottom.
.reset_index() — flatten the result
By default, group keys become the index of the result, not regular columns:
result = orders.groupby('order_status').size()
result.index # the order_status values
result.columns # AttributeError — it's a Series, not a DataFrame
.reset_index() makes them columns again:
result = (
orders.groupby('order_status').size()
.reset_index(name='n')
)
result.columns # ['order_status', 'n']
This matters when you want to merge the result with another DataFrame, or export it to CSV with the group keys as proper columns.
??? note “Try it yourself — group + aggregate on Olist”
Assume orders, customers, items, reviews are loaded.
1. What's the distribution of `review_score` in `reviews`? Use `.value_counts()`.
2. Number of orders per `customer_state` — top 10 only.
3. Total revenue per seller — top 10. *(Revenue = `SUM(items.price)`.)*
4. For each `order_status`, what's the average review score? *(Hint: you'll need a `merge` first — preview of Lesson 4.)*
??? success "Reveal solution"
```python
# 1
reviews['review_score'].value_counts().sort_index()
# 1 11424
# 2 3151
# 3 8179
# 4 19142
# 5 57420
# 2 — join needed because state lives on customers
order_state = (
orders.merge(customers, on='customer_id')
.groupby('customer_state').size()
.sort_values(ascending=False).head(10)
)
# 3
revenue_per_seller = (
items.groupby('seller_id')
.agg(revenue=('price', 'sum'))
.sort_values('revenue', ascending=False).head(10)
)
# 4
status_review = (
orders.merge(reviews, on='order_id')
.groupby('order_status')
.agg(avg_review=('review_score', 'mean'))
.sort_values('avg_review', ascending=False)
)
```
For #1, ~57% of reviews are 5-star and ~12% are 1-star — the same "U-shape" you saw in Day 1's Excel pivot.
Common pitfalls
- Trying to filter group results with
.query()mid-chain. You can — but.query()evaluates against the column names, which are the group-key column + the aggregate names. If the aggregate is unnamed (multiindex), this fails confusingly. Name your aggregates. .count()when you meant.size()..count()ignores NaN, which is sometimes right and sometimes silently misleading.- Forgetting
.reset_index()before merge or export. The group keys are in the index, not columns, andmergewon’t find them. - Lambda aggregates that break vectorisation. Built-in strings (
'mean','sum') are fast. Lambdas run per-group, slower. Fine for one-offs; avoid in tight loops. - Forgetting
observed=Trueon categorical groupings. If your group key is acategorydtype with unused levels, you’ll get empty rows in the result.groupby(..., observed=True)skips them.
How this shows up in the capstone
Both capstone tables — “worst categories” and “risky sellers” — are groupby + named-agg + sort + head. You’ll write the same shape four or five times. Lesson 4 (merge) provides the input; this lesson summarises it.
What’s next
Continue to Lesson 4 — Merge & combine.