Home/Data Analyst/Data Cleaning & Validation

Duplicate Detection

Duplicate Detection

Technical Explanation

Duplicates are multiple rows representing the same entity or event. They can inflate metrics, skew analysis, and indicate data quality problems. Finding and resolving them is a fundamental data cleaning task.

Types of Duplicates

Type Description Example
Complete All columns identical Same order imported twice
Partial Key columns same Same customer with slightly different email
Logical Same entity, different representation "John Doe" vs "Johnny Doe"

Detection Strategies

  • COUNT(*) vs COUNT(DISTINCT) comparison
  • GROUP BY with HAVING COUNT(*) > 1
  • Window functions with ROW_NUMBER

Code Examples

Using the CatCafe dataset:

-- Table: customers
-- id, name, email, phone, city, registration_date

-- Strategy 1: GROUP BY for finding duplicate keys
SELECT
    email,
    COUNT(*) as occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- Find all duplicate rows
WITH duplicates AS (
    SELECT
        email,
        COUNT(*) as cnt
    FROM customers
    GROUP BY email
    HAVING COUNT(*) > 1
)
SELECT c.*
FROM customers c
JOIN duplicates d ON c.email = d.email
ORDER BY c.email, c.id;

-- Strategy 2: Window function for complete duplicates
WITH numbered AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY name, email, phone, city
            ORDER BY id
        ) as row_num
    FROM customers
)
SELECT *
FROM numbered
WHERE row_num > 1;  -- Keep only one, delete others

-- Strategy 3: Finding partial duplicates (similar names)
SELECT
    c1.id as id1,
    c1.name as name1,
    c2.id as id2,
    c2.name as name2,
    c1.email as email1,
    c2.email as email2
FROM customers c1
JOIN customers c2
    ON c1.id < c2.id  -- Avoid self-join duplicates
    AND c1.name = c2.name  -- Same name
    AND c1.email != c2.email;  -- Different email

-- Strategy 4: Fuzzy matching for similar emails (PostgreSQL)
SELECT
    c1.id as id1,
    c1.email as email1,
    c2.id as id2,
    c2.email as email2,
    LEVENSHTEIN(c1.email, c2.email) as edit_distance
FROM customers c1
JOIN customers c2
    ON c1.id < c2.id
    AND LEVENSHTEIN(c1.email, c2.email) <= 2  -- Within 2 edits
ORDER BY edit_distance;

-- Practical: Resolving duplicates by keeping most recent
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY registration_date DESC
        ) as priority
    FROM customers
)
SELECT *
FROM ranked
WHERE priority = 1;

-- Finding duplicate orders
SELECT
    customer_id,
    order_date,
    total_amount,
    COUNT(*) as duplicate_count
FROM orders
GROUP BY customer_id, order_date, total_amount
HAVING COUNT(*) > 1;

-- Cross-table duplicates: Same customer created multiple accounts
SELECT
    c1.id as id1,
    c1.email as email1,
    c1.name as name1,
    c2.id as id2,
    c2.email as email2,
    c2.name as name2
FROM customers c1
JOIN customers c2
    ON c1.id < c2.id
    AND (
        c1.email = c2.email
        OR (c1.phone = c2.phone AND c1.phone IS NOT NULL)
    )
ORDER BY c1.email;

The Cat Analogy

Finding duplicate cats at a shelter:

Complete Duplicate:

Row 1: "Whiskers" | Siamese | 3 years | 123 Cat Lane
Row 2: "Whiskers" | Siamese | 3 years | 123 Cat Lane

Same cat entered twice — one entry should be deleted.

Partial Duplicate (Same cat, different info):

Row 1: "Whiskers" | Siamese | 3 years | 123 Cat Lane
Row 2: "Whiskers" | Siamese | 2 years | 456 Dog Ave

Same cat, slightly different age — might be same cat or different cats.

How to resolve:

  1. Check if it's truly the same entity
  2. Decide which record to keep (usually most recent or most complete)
  3. Merge or delete

Common Pitfalls

Assuming All "Duplicates" Should Be Deleted

-- WRONG: Deleting valid separate events
-- Two orders by same customer on same day = legitimate!
SELECT *
FROM orders
WHERE (customer_id, order_date) IN (
    SELECT customer_id, order_date
    FROM orders
    GROUP BY customer_id, order_date
    HAVING COUNT(*) > 1
);
-- These might be different orders, not errors!

-- RIGHT: Only delete true duplicates
WITH true_duplicates AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, order_date, total_amount, status
            ORDER BY id
        ) as dup_num
    FROM orders
)
DELETE FROM orders
WHERE id IN (
    SELECT id FROM true_duplicates WHERE dup_num > 1
);

Not Checking Business Rules

-- Some "duplicates" are valid:
-- Same person adopted 2 cats = 2 adoption records = NOT duplicates
-- But same person + same cat + same date = likely duplicate

Exercises

Exercise 1

Write a query to find all duplicate customer emails, showing all columns of the duplicate rows.

Exercise 2

Write a query to find orders that might be duplicates (same customer, same date, same amount).

Exercise 3

What is the difference between a complete duplicate and a partial duplicate?

Exercise 4

Write a query to find customers with the same phone number.

Exercise 5

Write a deduplication query that keeps the most recent record based on registration_date.


Key Takeaways

  • Duplicates inflate metrics and skew analysis
  • Use GROUP BY + HAVING COUNT(*) > 1 for finding duplicate keys
  • Use ROW_NUMBER to identify which rows to keep
  • Distinguish true duplicates from legitimate multiple records
  • Check business rules before deleting "duplicates"
  • Document your deduplication logic for reproducibility