Duplicate Detection
Duplicate Detection
Technical Explanation
Duplicates are multiple rows representing the same entity or event. They can inflate metrics, skew analysis, and indicate data quality problems. Finding and resolving them is a fundamental data cleaning task.
Types of Duplicates
| Type | Description | Example |
|---|---|---|
| Complete | All columns identical | Same order imported twice |
| Partial | Key columns same | Same customer with slightly different email |
| Logical | Same entity, different representation | "John Doe" vs "Johnny Doe" |
Detection Strategies
- COUNT(*) vs COUNT(DISTINCT) comparison
- GROUP BY with HAVING COUNT(*) > 1
- Window functions with ROW_NUMBER
Code Examples
Using the CatCafe dataset:
-- Table: customers
-- id, name, email, phone, city, registration_date
-- Strategy 1: GROUP BY for finding duplicate keys
SELECT
email,
COUNT(*) as occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- Find all duplicate rows
WITH duplicates AS (
SELECT
email,
COUNT(*) as cnt
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
)
SELECT c.*
FROM customers c
JOIN duplicates d ON c.email = d.email
ORDER BY c.email, c.id;
-- Strategy 2: Window function for complete duplicates
WITH numbered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY name, email, phone, city
ORDER BY id
) as row_num
FROM customers
)
SELECT *
FROM numbered
WHERE row_num > 1; -- Keep only one, delete others
-- Strategy 3: Finding partial duplicates (similar names)
SELECT
c1.id as id1,
c1.name as name1,
c2.id as id2,
c2.name as name2,
c1.email as email1,
c2.email as email2
FROM customers c1
JOIN customers c2
ON c1.id < c2.id -- Avoid self-join duplicates
AND c1.name = c2.name -- Same name
AND c1.email != c2.email; -- Different email
-- Strategy 4: Fuzzy matching for similar emails (PostgreSQL)
SELECT
c1.id as id1,
c1.email as email1,
c2.id as id2,
c2.email as email2,
LEVENSHTEIN(c1.email, c2.email) as edit_distance
FROM customers c1
JOIN customers c2
ON c1.id < c2.id
AND LEVENSHTEIN(c1.email, c2.email) <= 2 -- Within 2 edits
ORDER BY edit_distance;
-- Practical: Resolving duplicates by keeping most recent
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY registration_date DESC
) as priority
FROM customers
)
SELECT *
FROM ranked
WHERE priority = 1;
-- Finding duplicate orders
SELECT
customer_id,
order_date,
total_amount,
COUNT(*) as duplicate_count
FROM orders
GROUP BY customer_id, order_date, total_amount
HAVING COUNT(*) > 1;
-- Cross-table duplicates: Same customer created multiple accounts
SELECT
c1.id as id1,
c1.email as email1,
c1.name as name1,
c2.id as id2,
c2.email as email2,
c2.name as name2
FROM customers c1
JOIN customers c2
ON c1.id < c2.id
AND (
c1.email = c2.email
OR (c1.phone = c2.phone AND c1.phone IS NOT NULL)
)
ORDER BY c1.email;
The Cat Analogy
Finding duplicate cats at a shelter:
Complete Duplicate:
Row 1: "Whiskers" | Siamese | 3 years | 123 Cat Lane
Row 2: "Whiskers" | Siamese | 3 years | 123 Cat Lane
Same cat entered twice — one entry should be deleted.
Partial Duplicate (Same cat, different info):
Row 1: "Whiskers" | Siamese | 3 years | 123 Cat Lane
Row 2: "Whiskers" | Siamese | 2 years | 456 Dog Ave
Same cat, slightly different age — might be same cat or different cats.
How to resolve:
- Check if it's truly the same entity
- Decide which record to keep (usually most recent or most complete)
- Merge or delete
Common Pitfalls
Assuming All "Duplicates" Should Be Deleted
-- WRONG: Deleting valid separate events
-- Two orders by same customer on same day = legitimate!
SELECT *
FROM orders
WHERE (customer_id, order_date) IN (
SELECT customer_id, order_date
FROM orders
GROUP BY customer_id, order_date
HAVING COUNT(*) > 1
);
-- These might be different orders, not errors!
-- RIGHT: Only delete true duplicates
WITH true_duplicates AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY customer_id, order_date, total_amount, status
ORDER BY id
) as dup_num
FROM orders
)
DELETE FROM orders
WHERE id IN (
SELECT id FROM true_duplicates WHERE dup_num > 1
);
Not Checking Business Rules
-- Some "duplicates" are valid:
-- Same person adopted 2 cats = 2 adoption records = NOT duplicates
-- But same person + same cat + same date = likely duplicate
Exercises
Exercise 1
Write a query to find all duplicate customer emails, showing all columns of the duplicate rows.
Exercise 2
Write a query to find orders that might be duplicates (same customer, same date, same amount).
Exercise 3
What is the difference between a complete duplicate and a partial duplicate?
Exercise 4
Write a query to find customers with the same phone number.
Exercise 5
Write a deduplication query that keeps the most recent record based on registration_date.
Key Takeaways
- Duplicates inflate metrics and skew analysis
- Use GROUP BY + HAVING COUNT(*) > 1 for finding duplicate keys
- Use ROW_NUMBER to identify which rows to keep
- Distinguish true duplicates from legitimate multiple records
- Check business rules before deleting "duplicates"
- Document your deduplication logic for reproducibility