Home/Data Analyst/Data Cleaning & Validation

Text Standardization

Text Standardization

Technical Explanation

Text standardization ensures consistent text formatting across your data. This includes case normalization, whitespace handling, special character removal, and canonical forms for categories.

Why Text Inconsistency Matters

"Brooklyn" ≠ "brooklyn" ≠ "BROOKLYN" ≠ "Brooklyn "
→ JOIN fails, GROUP BY creates separate buckets, counts are wrong

Standardization Techniques

Technique SQL Function Example
Trim whitespace TRIM, LTRIM, RTRIM " Brooklyn " → "Brooklyn"
Case normalization UPPER, LOWER, INITCAP "john" → "John"
Remove special chars REGEXP_REPLACE "(555) 123-4567" → "5551234567"
Canonical forms CASE mapping "NY" → "New York"

Code Examples

Using the CatCafe dataset:

-- Table: customers
-- id, name, email, city, state, address

-- 1. Trim whitespace
SELECT
    id,
    name,
    TRIM(name) as name_trimmed,
    LTRIM(RTRIM(city)) as city_trimmed,
    TRIM(address) as address_trimmed
FROM customers;

-- 2. Case normalization
SELECT
    id,
    name,
    INITCAP(TRIM(name)) as name_proper,     -- "john doe" → "John Doe"
    LOWER(TRIM(email)) as email_lower,       -- For matching
    UPPER(TRIM(state)) as state_upper        -- Standard state codes
FROM customers;

-- 3. Remove unwanted characters
SELECT
    id,
    name,
    -- Remove multiple spaces
    REGEXP_REPLACE(name, '\s+', ' ', 'g') as name_normalized,
    -- Remove special characters from phone
    REGEXP_REPLACE(phone, '[^0-9a-zA-Z]', '', 'g') as phone_clean,
    -- Remove non-printable characters
    REGEXP_REPLACE(name, '[\x00-\x1F\x7F]', '', 'g') as name_clean
FROM customers;

-- 4. Create canonical categories
SELECT
    id,
    product_name,
    CASE
        -- Normalize variations
        WHEN LOWER(category) LIKE '%food%'
            AND LOWER(category) LIKE '%cat%' THEN 'Cat Food'
        WHEN LOWER(category) LIKE '%food%'
            AND LOWER(category) LIKE '%dog%' THEN 'Dog Food'
        WHEN LOWER(category) LIKE '%toy%' THEN 'Toys'
        WHEN LOWER(category) LIKE '%treat%'
            OR LOWER(category) LIKE '%snack%' THEN 'Treats'
        WHEN LOWER(category) LIKE '%bed%'
            OR LOWER(category) LIKE '%furniture%' THEN 'Beds & Furniture'
        ELSE category
    END as category_canonical
FROM products;

-- 5. Standardize state abbreviations
SELECT
    id,
    city,
    CASE
        WHEN UPPER(TRIM(state)) IN ('NY', 'NEW YORK')
            THEN 'NY'
        WHEN UPPER(TRIM(state)) IN ('CA', 'CALIFORNIA')
            THEN 'CA'
        WHEN UPPER(TRIM(state)) IN ('TX', 'TEXAS')
            THEN 'TX'
        WHEN UPPER(TRIM(state)) IN ('FL', 'FLORIDA')
            THEN 'FL'
        ELSE UPPER(TRIM(state))
    END as state_standardized
FROM customers;

-- 6. Full standardization workflow
WITH stage1 AS (
    SELECT
        id,
        TRIM(name) as name_raw,
        LOWER(TRIM(email)) as email_raw,
        TRIM(city) as city_raw,
        UPPER(TRIM(state)) as state_raw,
        REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits
    FROM customers
),
stage2 AS (
    SELECT
        id,
        INITCAP(name_raw) as name,
        email_raw as email,
        INITCAP(city_raw) as city,
        state_raw as state,
        CASE
            WHEN LENGTH(phone_digits) = 10
                THEN '(' || SUBSTRING(phone_digits FROM 1 FOR 3) || ') '
                    || SUBSTRING(phone_digits FROM 4 FOR 3) || '-'
                    || SUBSTRING(phone_digits FROM 7 FOR 4)
            ELSE phone_digits
        END as phone
    FROM stage1
)
SELECT * FROM stage2;

The Cat Analogy

Cats have thousands of name variations:

You ask: "Who wants treats?"
Responses heard:
  - "whiskers" (whisker cat)
  - "WHISKERS" (loud cat)
  - " Whiskers" (shy cat)
  - "whiskers " (forgetful cat)
  - "whiskrs" (typo)
  - "Mr. Whiskers" (formal cat)

All are the same cat, but your system treats them as 6 different cats!

Standardization = same answer:

TRIM → Remove spaces: " Whiskers " → "Whiskers"
LOWER → Remove case: "WHISKERS" → "whiskers"
INITCAP → Proper case: "whiskers" → "Whiskers"
→ All variations become "Whiskers" = treat time!

Common Pitfalls

Over-Normalizing

-- WRONG: Destroying meaningful information
SELECT
    LOWER(name) as name
FROM users;
-- "john@example.com" vs "JOHN@example.com" might be different users

-- RIGHT: Standardize for matching, preserve for display
SELECT
    name as name_display,
    LOWER(email) as email_for_matching
FROM users;

Not Handling NULLs

-- TRIM on NULL returns NULL, not empty string
SELECT TRIM(NULL);  -- Returns NULL

-- Handle NULLs explicitly
SELECT
    COALESCE(TRIM(name), 'Unknown') as name
FROM customers;

Exercises

Exercise 1

Write a query to standardize all customer names to proper case (INITCAP) and remove extra whitespace.

Exercise 2

Write a query to clean and standardize phone numbers to format (XXX) XXX-XXXX.

Exercise 3

Why might you want to keep the original text AND add a standardized column?

Exercise 4

Write a query to standardize product categories using multiple matching criteria.

Exercise 5

What challenges arise when standardizing international addresses?


Key Takeaways

  • Text inconsistency causes JOIN failures and incorrect aggregations
  • Always TRIM before comparing text
  • Use LOWER for matching, INITCAP for display
  • Create mapping tables for canonical categories
  • Keep original and standardized versions
  • Document standardization rules for reproducibility