Home/Data Analyst/SQL Fundamentals

Duplicate Rows After JOIN

Duplicate Rows After JOIN

Technical Explanation

One of the most common SQL bugs is getting unexpected duplicate rows after a JOIN. This happens when the JOIN condition produces multiple matches, causing row multiplication.

Why Duplicates Happen

A JOIN produces one row for every matching pair across tables:

Table A (2 rows)  JOIN  Table B (3 rows)
= up to 2 × 3 = 6 rows

If you expect 1-to-1 matches but actually have 1-to-many relationships, you'll get duplicate rows in your result.

Common Causes

Cause Example
Missing relationship in ON clause Joining on partial key
One-to-Many relationship customer → multiple orders
JOINed table has duplicates Lookup table with duplicate keys
Multiple matching rows in subquery Correlated subquery returning many rows

Code Examples

Using the CatCafe dataset:

-- cats: id, name, breed, age
-- adoption_fees: id, breed, fee (one fee per breed)
-- orders: id, customer_id, cat_id, total_amount
-- order_items: id, order_id, cat_id, quantity (multiple cats per order)

-- PROBLEM: One cat has multiple order_items
SELECT
    cats.name,
    orders.id as order_id,
    order_items.quantity
FROM cats
JOIN orders ON cats.id = orders.cat_id
JOIN order_items ON orders.id = order_items.order_id;
-- If one order has 3 cats in order_items, cat appears 3 times!

-- SOLUTION 1: Use DISTINCT
SELECT DISTINCT
    cats.name,
    orders.id as order_id
FROM cats
JOIN orders ON cats.id = orders.cat_id;

-- SOLUTION 2: Aggregate before JOIN
SELECT
    cats.name,
    orders.id as order_id,
    SUM(order_items.quantity) as total_items
FROM cats
JOIN orders ON cats.id = orders.cat_id
JOIN order_items ON orders.id = order_items.order_id
GROUP BY cats.id, cats.name, orders.id;

-- PROBLEM: Accidental cross join
-- (each customer matched with each cat breed)
SELECT
    customers.name,
    cats.name
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN cats;  -- OOPS: No ON clause!
-- Every customer combined with every cat = duplicates!

-- SOLUTION: Always specify JOIN conditions
SELECT
    customers.name,
    cats.name
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN cats ON orders.cat_id = cats.id;  -- Proper relationship

-- Real-world scenario: Orders with their items
-- WRONG: Getting duplicate orders
SELECT
    orders.id,
    orders.order_date,
    order_items.cat_id
FROM orders
JOIN order_items ON orders.id = order_items.order_id;
-- Returns multiple rows per order if order has multiple items

-- RIGHT: Aggregate item details
SELECT
    orders.id,
    orders.order_date,
    COUNT(order_items.id) as item_count,
    SUM(order_items.quantity) as total_quantity
FROM orders
JOIN order_items ON orders.id = order_items.order_id
GROUP BY orders.id, orders.order_date;

The Cat Analogy

You're making a cat show roster:

Correct approach — Each cat listed once:

Whiskers (Siamese) - Show #1
Mochi (Persian) - Show #2
Luna (Tabby) - Show #3

Problem — If you accidentally match each cat to each show:

Whiskers - Show #1
Whiskers - Show #2   ← Duplicate!
Whiskers - Show #3   ← Duplicate!
Mochi - Show #1
Mochi - Show #2      ← Duplicate!
...

This happens when you say "match cats TO shows" but don't specify "which show they were assigned to." You get every possible combination instead of the actual assignments.

In SQL terms, you forgot the ON cats.show_id = shows.id and just did cats JOIN shows.


Diagnosing Duplicates

-- Step 1: Identify which table is causing duplicates
-- Run each JOIN separately:
SELECT cats.name, orders.id FROM cats JOIN orders ON cats.id = orders.cat_id;
-- If cats repeat here, the issue is in orders

-- Step 2: Check for multiple matches
SELECT cat_id, COUNT(*) as match_count
FROM orders
GROUP BY cat_id
HAVING COUNT(*) > 1;
-- Shows which cats have multiple orders

-- Step 3: Use EXPLAIN to see row estimates
EXPLAIN
SELECT ...
FROM cats
JOIN orders ON cats.id = orders.cat_id;
-- Look for unexpected high row estimates

-- Step 4: Deduplicate with DISTINCT or GROUP BY
SELECT DISTINCT cats.name, orders.id ...
-- or
SELECT cats.name, orders.id, ANY_VALUE(orders.order_date) ...
GROUP BY cats.name, orders.id

Common Pitfalls

Assuming 1-to-1 When It's Actually 1-to-Many

-- WRONG: customer has 10 orders, you get 10 rows per customer
SELECT
    customers.name,
    orders.total_amount
FROM customers
JOIN orders ON customers.id = orders.customer_id;
-- Alice appears 10 times if she has 10 orders!

-- RIGHT: Aggregate per customer
SELECT
    customers.name,
    COUNT(orders.id) as order_count,
    SUM(orders.total_amount) as total_spent
FROM customers
LEFT JOIN orders ON customers.id = orders.customer_id
GROUP BY customers.id, customers.name;

JOINing Lookup Tables with Duplicates

-- discount_codes has multiple rows per code (different dates)!
-- WRONG:
SELECT orders.id, orders.total_amount, discount_codes.discount
FROM orders
JOIN discount_codes ON orders.code = discount_codes.code;
-- Same order appears multiple times!

-- RIGHT: Deduplicate lookup first
SELECT orders.id, orders.total_amount, discount_codes.discount
FROM orders
JOIN (
    SELECT DISTINCT code, discount
    FROM discount_codes
) discount_codes ON orders.code = discount_codes.code;

Exercises

Exercise 1

This query returns 150 rows but you expect 50. What might be causing duplicates?

SELECT customers.name, orders.total_amount
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN cats ON orders.cat_id = cats.id;

Exercise 2

Write a query to find customers and their total spending, handling the case where a customer might have multiple orders with the same cat.

Exercise 3

Explain the difference between these two approaches to deduplication:

  • Using DISTINCT in SELECT
  • Using GROUP BY with aggregates

Exercise 4

Write a query that shows each order with its total items, but only one row per order (even if order has multiple items).

Exercise 5

A lookup table cat_breeds has breed names and fur_type. If joining causes duplicates, what's likely wrong with cat_breeds?


Key Takeaways

  • Duplicates after JOIN happen when the ON condition matches multiple rows
  • Check which table(s) have the unexpected multiple matches
  • Use GROUP BY with aggregates to collapse duplicates
  • Use DISTINCT when you need unique pairs without aggregation
  • Always verify cardinality (1:1, 1:many, many:many) before JOINing
  • EXPLAIN can reveal unexpected row multiplication in query plans