Home/Data Analyst/Data Cleaning & Validation

Inconsistent Formatting

Inconsistent Formatting

Technical Explanation

Inconsistent formatting occurs when the same data is represented in different ways across records. It can cause JOIN failures, incorrect aggregations, and misleading metrics. Standardization is the process of making data consistent.

Common Inconsistencies

Data Type Issues Examples
Names Case, initials, titles "John Doe" vs "JOHN DOE" vs "Doe, John"
Addresses Abbreviations vs full "St" vs "Street", "NY" vs "New York"
Phone numbers Format, country codes "(555) 123-4567" vs "555-123-4567"
Dates Format, timezone "01/02/2024" vs "Jan 2, 2024"
Categories Synonyms, misspellings "Cat Food" vs "catfood" vs "Kat Food"

Code Examples

Using the CatCafe dataset:

-- Table: customers
-- id, name, email, city, state, phone, registration_date

-- Step 1: Find inconsistencies
SELECT DISTINCT city FROM customers ORDER BY city;
-- See: 'brooklyn', 'Brooklyn', 'BROOKLYN', 'Brooklyn ', 'Brooklyn Heights'

-- Step 2: Standardize text fields
SELECT
    id,
    -- Standardize names: Title case
    INITCAP(TRIM(name)) as name_clean,
    -- Standardize city: Title case, trim whitespace
    INITCAP(TRIM(city)) as city_clean,
    -- Standardize state: Upper case
    UPPER(TRIM(state)) as state_clean
FROM customers;

-- Step 3: Standardize phone numbers
SELECT
    id,
    name,
    -- Remove all non-digits
    REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits,
    -- Format as (XXX) XXX-XXXX
    '(' ||
    SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 1 FOR 3) || ') ' ||
    SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 4 FOR 3) || '-' ||
    SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', '', 'g') FROM 7 FOR 4)
    as phone_formatted
FROM customers;

-- Step 4: Standardize categories with mapping
SELECT
    id,
    product_name,
    CASE
        -- Fix typos and synonyms
        WHEN LOWER(category) IN ('catfood', 'cat food', 'catoof', 'katfood')
            THEN 'Cat Food'
        WHEN LOWER(category) IN ('toys', 'toy', 'playthings')
            THEN 'Toys'
        WHEN LOWER(category) IN ('accessories', 'accessory', 'acc')
            THEN 'Accessories'
        WHEN LOWER(category) IN ('treats', 'treat', 'snacks')
            THEN 'Treats'
        ELSE category
    END as category_standardized
FROM products;

-- Step 5: Standardize dates (assuming different formats)
SELECT
    id,
    registration_date,
    -- Convert various formats to standard DATE
    CASE
        WHEN registration_date LIKE '__/__/____' THEN
            TO_DATE(registration_date, 'MM/DD/YYYY')
        WHEN registration_date LIKE '____-__-__' THEN
            TO_DATE(registration_date, 'YYYY-MM-DD')
        WHEN registration_date LIKE '__-___-____' THEN
            TO_DATE(registration_date, 'DD-Mon-YYYY')
        ELSE registration_date::DATE
    END as registration_date_clean
FROM customers;

-- Step 6: Clean and validate in one query
WITH cleaned AS (
    SELECT
        id,
        TRIM(name) as name,
        LOWER(TRIM(email)) as email,  -- Lowercase emails
        INITCAP(TRIM(city)) as city,
        UPPER(TRIM(state)) as state,
        REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits
    FROM customers
)
SELECT
    id,
    name,
    email,
    city,
    state,
    CASE
        WHEN LENGTH(phone_digits) = 10
            THEN '(' || SUBSTRING(phone_digits FROM 1 FOR 3) || ') ' ||
                 SUBSTRING(phone_digits FROM 4 FOR 3) || '-' ||
                 SUBSTRING(phone_digits FROM 7 FOR 4)
        ELSE phone_digits  -- Return as-is if not 10 digits
    END as phone_formatted
FROM cleaned;

The Cat Analogy

Cat names have thousands of spellings:

Database shows:
- "Whiskers" (correct)
- "whiskers" (lowercase)
- "WHISKERS" (all caps)
- " Whiskers " (spaces)
- "Whiskrs" (typo)
- "Mr. Whiskers" (title added)
- "whiskers " (trailing space)
- "wHisKeRs" (random case)

All supposed to be the same cat! But queries for "Whiskers" miss "whiskers" and "WHISKERS."

Standardization = making them all "Whiskers":

INITCAP(TRIM(name))
→ "whiskers" → "Whiskers"
→ " WHISKERS " → "Whiskers"
→ "Mr. Whiskers" → "Mr. Whiskers" (might need special handling)

Common Pitfalls

Over-Standardizing

-- WRONG: Standardizing things that should stay distinct
SELECT
    -- Destroying meaningful case differences!
    LOWER(name) as name
FROM users;
-- "john@example.com" vs "John@Example.COM" = same user, no issue
-- But user names might need case preserved

-- RIGHT: Standardize only what's needed for matching/joining
SELECT
    LOWER(email) as email,  -- For matching
    name  -- Keep original case
FROM users;

Not Handling Edge Cases

-- WRONG: Assuming standard format
SELECT *
FROM customers
WHERE city = 'Brooklyn';
-- Misses 'Brooklyn ' (trailing space) and 'brooklyn'

-- RIGHT: Clean before comparing
SELECT *
FROM customers
WHERE LOWER(TRIM(city)) = 'brooklyn';

Exercises

Exercise 1

Write a query to find all distinct city values and identify which ones might be the same city despite different formatting.

Exercise 2

Write a query to standardize phone numbers to format (XXX) XXX-XXXX.

Exercise 3

What inconsistencies might exist in email addresses? How would you handle them?

Exercise 4

Write a query to standardize product categories using CASE statements.

Exercise 5

Why might you want to preserve the original value while adding a cleaned version?


Key Takeaways

  • Inconsistent formatting causes JOIN failures and incorrect counts
  • Use TRIM, UPPER, LOWER, INITCAP for standardization
  • Standardize before comparing or joining
  • Create a cleaned column while preserving the original
  • Handle edge cases (trailing spaces, nulls, wrong formats)
  • Document standardization rules for reproducibility