DISTINCT Not Always the Solution
DISTINCT Not Always the Solution
Technical Explanation
DISTINCT removes duplicate rows from results, but it's often a band-aid solution that masks data quality problems rather than solving them. Using DISTINCT without understanding why duplicates exist is like silencing an alarm instead of fixing the fire.
When DISTINCT is Correct
| Scenario | Example |
|---|---|
| Counting unique entities | COUNT(DISTINCT customer_id) |
| Getting unique pairs | SELECT DISTINCT city, breed FROM cats |
| After a legitimate operation | Union of two result sets |
When DISTINCT is a Warning Sign
| Problem | DISTINCT Hides |
|---|---|
| Missing JOIN condition | Cartesian product |
| Incorrect relationship | Wrong table relationship |
| Missing WHERE filter | Data you should have excluded |
| Application bug | Duplicate inserts |
Code Examples
Using the CatCafe dataset:
-- PROBLEM: Missing JOIN condition causes duplicates
-- WRONG:
SELECT
c.name,
o.total_amount
FROM customers c, orders o
WHERE c.city = 'Brooklyn';
-- Every customer matched with every order = duplicates!
-- RIGHT: Proper JOIN condition
SELECT
c.name,
o.total_amount
FROM customers c
JOIN orders o ON c.id = o.customer_id -- Proper relationship
WHERE c.city = 'Brooklyn';
-- PROBLEM: Forgetting a filter
-- WRONG: Getting duplicate rows because of multiple statuses
SELECT DISTINCT
customer_id,
order_date
FROM orders;
-- If same customer has completed + pending orders, they appear twice!
-- RIGHT: Filter before selecting
SELECT DISTINCT
customer_id,
order_date
FROM orders
WHERE status = 'completed';
-- Or use if you only want the latest:
SELECT DISTINCT ON (customer_id)
customer_id,
order_date
FROM orders
WHERE status = 'completed'
ORDER BY customer_id, order_date DESC;
-- PROBLEM: DISTINCT after complex JOINs
-- WRONG approach:
SELECT DISTINCT
cats.name,
cats.breed
FROM cats
JOIN orders ON cats.id = orders.cat_id
JOIN customers ON orders.customer_id = customers.id;
-- DISTINCT hiding that JOIN produced duplicates!
-- RIGHT approach: Understand WHY duplicates exist
WITH cat_customer_pairs AS (
-- First, see the raw joins
SELECT DISTINCT
cats.id as cat_id,
customers.id as customer_id
FROM cats
JOIN orders ON cats.id = orders.cat_id
JOIN customers ON orders.customer_id = customers.id
)
SELECT
cats.name,
cats.breed,
COUNT(*) as customer_count
FROM cat_customer_pairs
JOIN cats ON cat_customer_pairs.cat_id = cats.id
GROUP BY cats.id, cats.name, cats.breed;
-- Now we understand: multiple customers per cat!
-- When DISTINCT is the right tool
-- Counting unique customers who ordered:
SELECT COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE status = 'completed';
-- Finding unique city-breed combinations:
SELECT DISTINCT city, breed
FROM cats;
-- Intentionally asking for unique pairs
The Cat Analogy
Your cat shelter logs every interaction:
Wrong approach: See duplicates, add DISTINCT:
Log shows:
Whiskers visited 3 times this week
Mochi visited 2 times
Whiskers visited 3 times this week ← Wait, duplicate entry?
DISTINCT solution: "2 unique cats visited"
Problem: Whiskers was actually counted twice because of a logging bug!
Right approach: Fix the root cause:
Step 1: Why did Whiskers appear twice?
→ Log entry created twice by mistake
Step 2: Fix the logging system
Step 3: Now DISTINCT gives accurate count of 2
DISTINCT on bad data = wrong answer confirmed with false confidence.
Diagnostic Approach
Before Reaching for DISTINCT
-- Step 1: See what duplicates look like
SELECT
customer_id,
order_id,
COUNT(*)
FROM orders
GROUP BY customer_id, order_id
HAVING COUNT(*) > 1;
-- Are these true duplicates or valid multiple rows?
-- Step 2: Look at the raw rows
SELECT *
FROM orders
WHERE customer_id = :suspected_id;
-- What's causing the duplication?
-- Step 3: Fix the query, not with DISTINCT but with correct logic
Questions to Ask
- Are these true duplicates? (Same values in all columns)
- Should these rows be aggregated? (Different orders)
- Is the JOIN condition correct? (Missing relationship)
- Should a filter exclude these? (Wrong WHERE)
Exercises
Exercise 1
This query returns 150 rows. Explain what might be causing duplicates:
SELECT DISTINCT customers.name, orders.total_amount
FROM customers
JOIN orders ON customers.city = orders.id;
Exercise 2
When is DISTINCT the correct solution vs when is it hiding a problem?
Exercise 3
Write a query to diagnose why a query might return duplicate customer names.
Exercise 4
What does DISTINCT do under the hood? Why might it be slow on large datasets?
Exercise 5
Rewrite this query without DISTINCT, fixing the root cause:
SELECT DISTINCT c.name, c.email
FROM customers c
JOIN orders o ON c.city = o.customer_id;
Key Takeaways
- DISTINCT removes duplicates but often hides data quality problems
- Ask WHY duplicates exist before applying DISTINCT
- Common causes: missing JOIN conditions, wrong relationships, missing filters
- COUNT(DISTINCT column) is appropriate for counting unique values
- DISTINCT ON (PostgreSQL) can be powerful for getting first rows per group
- Use diagnostic queries to understand duplicates before treating them
- Fix the root cause rather than masking with DISTINCT