Inconsistent Formatting
Inconsistent Formatting
Technical Explanation
Inconsistent formatting occurs when the same data is represented in different ways across records. It can cause JOIN failures, incorrect aggregations, and misleading metrics. Standardization is the process of making data consistent.
Common Inconsistencies
| Data Type | Issues | Examples |
|---|---|---|
| Names | Case, initials, titles | "John Doe" vs "JOHN DOE" vs "Doe, John" |
| Addresses | Abbreviations vs full | "St" vs "Street", "NY" vs "New York" |
| Phone numbers | Format, country codes | "(555) 123-4567" vs "555-123-4567" |
| Dates | Format, timezone | "01/02/2024" vs "Jan 2, 2024" |
| Categories | Synonyms, misspellings | "Cat Food" vs "catfood" vs "Kat Food" |
Code Examples
Using the CatCafe dataset:
-- Table: customers
-- id, name, email, city, state, phone, registration_date
-- Step 1: Find inconsistencies
SELECT DISTINCT city FROM customers ORDER BY city;
-- See: 'brooklyn', 'Brooklyn', 'BROOKLYN', 'Brooklyn ', 'Brooklyn Heights'
-- Step 2: Standardize text fields
SELECT
id,
-- Standardize names: Title case
INITCAP(TRIM(name)) as name_clean,
-- Standardize city: Title case, trim whitespace
INITCAP(TRIM(city)) as city_clean,
-- Standardize state: Upper case
UPPER(TRIM(state)) as state_clean
FROM customers;
-- Step 3: Standardize phone numbers
SELECT
id,
name,
-- Remove all non-digits
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits,
-- Format as (XXX) XXX-XXXX
'(' ||
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 1 FOR 3) || ') ' ||
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 4 FOR 3) || '-' ||
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 7 FOR 4)
as phone_formatted
FROM customers;
-- Step 4: Standardize categories with mapping
SELECT
id,
product_name,
CASE
-- Fix typos and synonyms
WHEN LOWER(category) IN ('catfood', 'cat food', 'catoof', 'katfood')
THEN 'Cat Food'
WHEN LOWER(category) IN ('toys', 'toy', 'playthings')
THEN 'Toys'
WHEN LOWER(category) IN ('accessories', 'accessory', 'acc')
THEN 'Accessories'
WHEN LOWER(category) IN ('treats', 'treat', 'snacks')
THEN 'Treats'
ELSE category
END as category_standardized
FROM products;
-- Step 5: Standardize dates (assuming different formats)
SELECT
id,
registration_date,
-- Convert various formats to standard DATE
CASE
WHEN registration_date LIKE '__/__/____' THEN
TO_DATE(registration_date, 'MM/DD/YYYY')
WHEN registration_date LIKE '____-__-__' THEN
TO_DATE(registration_date, 'YYYY-MM-DD')
WHEN registration_date LIKE '__-___-____' THEN
TO_DATE(registration_date, 'DD-Mon-YYYY')
ELSE registration_date::DATE
END as registration_date_clean
FROM customers;
-- Step 6: Clean and validate in one query
WITH cleaned AS (
SELECT
id,
TRIM(name) as name,
LOWER(TRIM(email)) as email, -- Lowercase emails
INITCAP(TRIM(city)) as city,
UPPER(TRIM(state)) as state,
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits
FROM customers
)
SELECT
id,
name,
email,
city,
state,
CASE
WHEN LENGTH(phone_digits) = 10
THEN '(' || SUBSTRING(phone_digits FROM 1 FOR 3) || ') ' ||
SUBSTRING(phone_digits FROM 4 FOR 3) || '-' ||
SUBSTRING(phone_digits FROM 7 FOR 4)
ELSE phone_digits -- Return as-is if not 10 digits
END as phone_formatted
FROM cleaned;
The Cat Analogy
Cat names have thousands of spellings:
Database shows:
- "Whiskers" (correct)
- "whiskers" (lowercase)
- "WHISKERS" (all caps)
- " Whiskers " (spaces)
- "Whiskrs" (typo)
- "Mr. Whiskers" (title added)
- "whiskers " (trailing space)
- "wHisKeRs" (random case)
All supposed to be the same cat! But queries for "Whiskers" miss "whiskers" and "WHISKERS."
Standardization = making them all "Whiskers":
INITCAP(TRIM(name))
→ "whiskers" → "Whiskers"
→ " WHISKERS " → "Whiskers"
→ "Mr. Whiskers" → "Mr. Whiskers" (might need special handling)
Common Pitfalls
Over-Standardizing
-- WRONG: Standardizing things that should stay distinct
SELECT
-- Destroying meaningful case differences!
LOWER(name) as name
FROM users;
-- "john@example.com" vs "John@Example.COM" = same user, no issue
-- But user names might need case preserved
-- RIGHT: Standardize only what's needed for matching/joining
SELECT
LOWER(email) as email, -- For matching
name -- Keep original case
FROM users;
Not Handling Edge Cases
-- WRONG: Assuming standard format
SELECT *
FROM customers
WHERE city = 'Brooklyn';
-- Misses 'Brooklyn ' (trailing space) and 'brooklyn'
-- RIGHT: Clean before comparing
SELECT *
FROM customers
WHERE LOWER(TRIM(city)) = 'brooklyn';
Exercises
Exercise 1
Write a query to find all distinct city values and identify which ones might be the same city despite different formatting.
Exercise 2
Write a query to standardize phone numbers to format (XXX) XXX-XXXX.
Exercise 3
What inconsistencies might exist in email addresses? How would you handle them?
Exercise 4
Write a query to standardize product categories using CASE statements.
Exercise 5
Why might you want to preserve the original value while adding a cleaned version?
Key Takeaways
- Inconsistent formatting causes JOIN failures and incorrect counts
- Use TRIM, UPPER, LOWER, INITCAP for standardization
- Standardize before comparing or joining
- Create a cleaned column while preserving the original
- Handle edge cases (trailing spaces, nulls, wrong formats)
- Document standardization rules for reproducibility