Home/Data Analyst/SQL Fundamentals

GROUP BY Purpose

GROUP BY Purpose

Technical Explanation

GROUP BY collapses multiple rows into summary rows based on shared values in one or more columns. It changes the granularity of your data from individual records to group-level summaries.

What GROUP BY Does

  1. Partitions rows into groups based on column values
  2. Applies aggregate functions (COUNT, SUM, AVG, MIN, MAX) to each group
  3. Returns one row per group with the aggregated results

The Rule

Any column in the SELECT that is not wrapped in an aggregate function must appear in the GROUP BY clause.

-- Valid: all non-aggregates are in GROUP BY
SELECT city, COUNT(*) as cnt
FROM customers
GROUP BY city;

-- Invalid: name is not aggregated and not in GROUP BY
SELECT name, city, COUNT(*)  -- ERROR: name must be in GROUP BY
FROM customers
GROUP BY city;

How It Works (Execution Order)

1. FROM: Load the table
2. WHERE: Filter rows
3. GROUP BY: Create groups
4. Aggregate: Calculate per-group values
5. SELECT: Return results

Code Examples

Using the CatCafe dataset:

-- Table: orders
-- id, customer_id, cat_id, order_date, total_amount, status

-- Basic GROUP BY: Total sales per customer
SELECT
    customer_id,
    COUNT(*) as order_count,
    SUM(total_amount) as total_spent,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY customer_id;

-- Group by multiple columns: Sales per customer per month
SELECT
    customer_id,
    DATE_TRUNC('month', order_date) as month,
    COUNT(*) as orders,
    SUM(total_amount) as revenue
FROM orders
WHERE status = 'completed'
GROUP BY customer_id, DATE_TRUNC('month', order_date)
ORDER BY customer_id, month;

-- GROUP BY with filtering: Top customers by city
SELECT
    customers.city,
    COUNT(DISTINCT customers.id) as customer_count,
    SUM(orders.total_amount) as total_revenue,
    AVG(orders.total_amount) as avg_order
FROM customers
JOIN orders
    ON customers.id = orders.customer_id
WHERE orders.status = 'completed'
GROUP BY customers.city
HAVING SUM(orders.total_amount) > 1000
ORDER BY total_revenue DESC;

-- Practical: Monthly revenue report
SELECT
    DATE_TRUNC('month', order_date) as revenue_month,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) as total_orders,
    SUM(total_amount) as gross_revenue,
    SUM(total_amount) / COUNT(DISTINCT customer_id) as revenue_per_customer
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY revenue_month;

The Cat Analogy

Imagine you're a cat cafe manager taking inventory:

Without GROUP BY — you have a long list:

Cat 1: Tuna, Cat 2: Salmon, Cat 3: Tuna, Cat 4: Chicken, Cat 5: Salmon...

You can see every individual purchase but can't answer "how many of each?"

With GROUP BY — you summarize:

Tuna: 45 cans
Salmon: 38 cans
Chicken: 22 cans

You've changed from "individual items" to "items by category." The grain changed from "one row per purchase" to "one row per food type."

GROUP BY is like categorizing your cats: you go from "individual cats" to "cats by breed" or "cats by age group." The aggregate functions (COUNT, AVG) give you insights about each group.


Common Pitfalls

Mixing Grouped and Non-Grouped Columns

-- WRONG:
SELECT
    customer_id,
    order_id,           -- Not aggregated, not in GROUP BY
    total_amount
FROM orders
GROUP BY customer_id;  -- ERROR

-- If you need order details, either:
-- 1. Don't group (remove GROUP BY)
-- 2. Aggregate the detail (COUNT(order_id))
-- 3. Use window functions instead

Accidental Grouping

-- This groups by customer_id AND order_id (essentially no grouping)
SELECT customer_id, order_id, SUM(total_amount)
FROM orders
GROUP BY customer_id, order_id;
-- Returns same rows as not grouping at all!

Using GROUP BY When You Need Window Functions

-- Get total per customer AND keep individual rows
-- GROUP BY can't do this - you need a window function:
SELECT
    order_id,
    customer_id,
    total_amount,
    SUM(total_amount) OVER (PARTITION BY customer_id) as customer_total
FROM orders;

Exercises

Exercise 1

Using the orders table, write a query that shows:

  • Number of orders per customer
  • Total revenue per customer
  • Average order value per customer Only for customers with more than 3 orders.

Exercise 2

Explain the difference between these two queries:

-- Query A
SELECT COUNT(*) FROM orders;

-- Query B
SELECT COUNT(*), customer_id FROM orders GROUP BY customer_id;

Exercise 3

Write a query to find the most popular cat (by number of orders involving that cat).

Exercise 4

Using the cats table, write a query that shows:

  • Number of cats per breed
  • Average age per breed
  • Only breeds with 3 or more cats

Exercise 5

When would you use DATE_TRUNC('month', order_date) in a GROUP BY instead of just order_date?


Key Takeaways

  • GROUP BY changes data granularity from individual rows to group summaries
  • Every non-aggregated column in SELECT must be in GROUP BY
  • Common aggregates: COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT ...)
  • Use HAVING (not WHERE) to filter on aggregate results
  • For running totals or per-row aggregates, consider window functions instead