Home/Data Analyst/SQL Intermediate

DISTINCT Not Always the Solution

DISTINCT Not Always the Solution

Technical Explanation

DISTINCT removes duplicate rows from results, but it's often a band-aid solution that masks data quality problems rather than solving them. Using DISTINCT without understanding why duplicates exist is like silencing an alarm instead of fixing the fire.

When DISTINCT is Correct

Scenario Example
Counting unique entities COUNT(DISTINCT customer_id)
Getting unique pairs SELECT DISTINCT city, breed FROM cats
After a legitimate operation Union of two result sets

When DISTINCT is a Warning Sign

Problem DISTINCT Hides
Missing JOIN condition Cartesian product
Incorrect relationship Wrong table relationship
Missing WHERE filter Data you should have excluded
Application bug Duplicate inserts

Code Examples

Using the CatCafe dataset:

-- PROBLEM: Missing JOIN condition causes duplicates
-- WRONG:
SELECT
    c.name,
    o.total_amount
FROM customers c, orders o
WHERE c.city = 'Brooklyn';
-- Every customer matched with every order = duplicates!

-- RIGHT: Proper JOIN condition
SELECT
    c.name,
    o.total_amount
FROM customers c
JOIN orders o ON c.id = o.customer_id  -- Proper relationship
WHERE c.city = 'Brooklyn';

-- PROBLEM: Forgetting a filter
-- WRONG: Getting duplicate rows because of multiple statuses
SELECT DISTINCT
    customer_id,
    order_date
FROM orders;
-- If same customer has completed + pending orders, they appear twice!

-- RIGHT: Filter before selecting
SELECT DISTINCT
    customer_id,
    order_date
FROM orders
WHERE status = 'completed';
-- Or use if you only want the latest:
SELECT DISTINCT ON (customer_id)
    customer_id,
    order_date
FROM orders
WHERE status = 'completed'
ORDER BY customer_id, order_date DESC;

-- PROBLEM: DISTINCT after complex JOINs
-- WRONG approach:
SELECT DISTINCT
    cats.name,
    cats.breed
FROM cats
JOIN orders ON cats.id = orders.cat_id
JOIN customers ON orders.customer_id = customers.id;
-- DISTINCT hiding that JOIN produced duplicates!

-- RIGHT approach: Understand WHY duplicates exist
WITH cat_customer_pairs AS (
    -- First, see the raw joins
    SELECT DISTINCT
        cats.id as cat_id,
        customers.id as customer_id
    FROM cats
    JOIN orders ON cats.id = orders.cat_id
    JOIN customers ON orders.customer_id = customers.id
)
SELECT
    cats.name,
    cats.breed,
    COUNT(*) as customer_count
FROM cat_customer_pairs
JOIN cats ON cat_customer_pairs.cat_id = cats.id
GROUP BY cats.id, cats.name, cats.breed;
-- Now we understand: multiple customers per cat!

-- When DISTINCT is the right tool
-- Counting unique customers who ordered:
SELECT COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE status = 'completed';

-- Finding unique city-breed combinations:
SELECT DISTINCT city, breed
FROM cats;
-- Intentionally asking for unique pairs

The Cat Analogy

Your cat shelter logs every interaction:

Wrong approach: See duplicates, add DISTINCT:

Log shows:
Whiskers visited 3 times this week
Mochi visited 2 times
Whiskers visited 3 times this week  ← Wait, duplicate entry?

DISTINCT solution: "2 unique cats visited"
Problem: Whiskers was actually counted twice because of a logging bug!

Right approach: Fix the root cause:

Step 1: Why did Whiskers appear twice?
  → Log entry created twice by mistake

Step 2: Fix the logging system

Step 3: Now DISTINCT gives accurate count of 2

DISTINCT on bad data = wrong answer confirmed with false confidence.


Diagnostic Approach

Before Reaching for DISTINCT

-- Step 1: See what duplicates look like
SELECT
    customer_id,
    order_id,
    COUNT(*)
FROM orders
GROUP BY customer_id, order_id
HAVING COUNT(*) > 1;
-- Are these true duplicates or valid multiple rows?

-- Step 2: Look at the raw rows
SELECT *
FROM orders
WHERE customer_id = :suspected_id;
-- What's causing the duplication?

-- Step 3: Fix the query, not with DISTINCT but with correct logic

Questions to Ask

  1. Are these true duplicates? (Same values in all columns)
  2. Should these rows be aggregated? (Different orders)
  3. Is the JOIN condition correct? (Missing relationship)
  4. Should a filter exclude these? (Wrong WHERE)

Exercises

Exercise 1

This query returns 150 rows. Explain what might be causing duplicates:

SELECT DISTINCT customers.name, orders.total_amount
FROM customers
JOIN orders ON customers.city = orders.id;

Exercise 2

When is DISTINCT the correct solution vs when is it hiding a problem?

Exercise 3

Write a query to diagnose why a query might return duplicate customer names.

Exercise 4

What does DISTINCT do under the hood? Why might it be slow on large datasets?

Exercise 5

Rewrite this query without DISTINCT, fixing the root cause:

SELECT DISTINCT c.name, c.email
FROM customers c
JOIN orders o ON c.city = o.customer_id;

Key Takeaways

  • DISTINCT removes duplicates but often hides data quality problems
  • Ask WHY duplicates exist before applying DISTINCT
  • Common causes: missing JOIN conditions, wrong relationships, missing filters
  • COUNT(DISTINCT column) is appropriate for counting unique values
  • DISTINCT ON (PostgreSQL) can be powerful for getting first rows per group
  • Use diagnostic queries to understand duplicates before treating them
  • Fix the root cause rather than masking with DISTINCT