Text Standardization
Text Standardization
Technical Explanation
Text standardization ensures consistent text formatting across your data. This includes case normalization, whitespace handling, special character removal, and canonical forms for categories.
Why Text Inconsistency Matters
"Brooklyn" ≠ "brooklyn" ≠ "BROOKLYN" ≠ "Brooklyn "
→ JOIN fails, GROUP BY creates separate buckets, counts are wrong
Standardization Techniques
| Technique | SQL Function | Example |
|---|---|---|
| Trim whitespace | TRIM, LTRIM, RTRIM | " Brooklyn " → "Brooklyn" |
| Case normalization | UPPER, LOWER, INITCAP | "john" → "John" |
| Remove special chars | REGEXP_REPLACE | "(555) 123-4567" → "5551234567" |
| Canonical forms | CASE mapping | "NY" → "New York" |
Code Examples
Using the CatCafe dataset:
-- Table: customers
-- id, name, email, city, state, address
-- 1. Trim whitespace
SELECT
id,
name,
TRIM(name) as name_trimmed,
LTRIM(RTRIM(city)) as city_trimmed,
TRIM(address) as address_trimmed
FROM customers;
-- 2. Case normalization
SELECT
id,
name,
INITCAP(TRIM(name)) as name_proper, -- "john doe" → "John Doe"
LOWER(TRIM(email)) as email_lower, -- For matching
UPPER(TRIM(state)) as state_upper -- Standard state codes
FROM customers;
-- 3. Remove unwanted characters
SELECT
id,
name,
-- Remove multiple spaces
REGEXP_REPLACE(name, '\s+', ' ', 'g') as name_normalized,
-- Remove special characters from phone
REGEXP_REPLACE(phone, '[^0-9a-zA-Z]', '', 'g') as phone_clean,
-- Remove non-printable characters
REGEXP_REPLACE(name, '[\x00-\x1F\x7F]', '', 'g') as name_clean
FROM customers;
-- 4. Create canonical categories
SELECT
id,
product_name,
CASE
-- Normalize variations
WHEN LOWER(category) LIKE '%food%'
AND LOWER(category) LIKE '%cat%' THEN 'Cat Food'
WHEN LOWER(category) LIKE '%food%'
AND LOWER(category) LIKE '%dog%' THEN 'Dog Food'
WHEN LOWER(category) LIKE '%toy%' THEN 'Toys'
WHEN LOWER(category) LIKE '%treat%'
OR LOWER(category) LIKE '%snack%' THEN 'Treats'
WHEN LOWER(category) LIKE '%bed%'
OR LOWER(category) LIKE '%furniture%' THEN 'Beds & Furniture'
ELSE category
END as category_canonical
FROM products;
-- 5. Standardize state abbreviations
SELECT
id,
city,
CASE
WHEN UPPER(TRIM(state)) IN ('NY', 'NEW YORK')
THEN 'NY'
WHEN UPPER(TRIM(state)) IN ('CA', 'CALIFORNIA')
THEN 'CA'
WHEN UPPER(TRIM(state)) IN ('TX', 'TEXAS')
THEN 'TX'
WHEN UPPER(TRIM(state)) IN ('FL', 'FLORIDA')
THEN 'FL'
ELSE UPPER(TRIM(state))
END as state_standardized
FROM customers;
-- 6. Full standardization workflow
WITH stage1 AS (
SELECT
id,
TRIM(name) as name_raw,
LOWER(TRIM(email)) as email_raw,
TRIM(city) as city_raw,
UPPER(TRIM(state)) as state_raw,
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') as phone_digits
FROM customers
),
stage2 AS (
SELECT
id,
INITCAP(name_raw) as name,
email_raw as email,
INITCAP(city_raw) as city,
state_raw as state,
CASE
WHEN LENGTH(phone_digits) = 10
THEN '(' || SUBSTRING(phone_digits FROM 1 FOR 3) || ') '
|| SUBSTRING(phone_digits FROM 4 FOR 3) || '-'
|| SUBSTRING(phone_digits FROM 7 FOR 4)
ELSE phone_digits
END as phone
FROM stage1
)
SELECT * FROM stage2;
The Cat Analogy
Cats have thousands of name variations:
You ask: "Who wants treats?"
Responses heard:
- "whiskers" (whisker cat)
- "WHISKERS" (loud cat)
- " Whiskers" (shy cat)
- "whiskers " (forgetful cat)
- "whiskrs" (typo)
- "Mr. Whiskers" (formal cat)
All are the same cat, but your system treats them as 6 different cats!
Standardization = same answer:
TRIM → Remove spaces: " Whiskers " → "Whiskers"
LOWER → Remove case: "WHISKERS" → "whiskers"
INITCAP → Proper case: "whiskers" → "Whiskers"
→ All variations become "Whiskers" = treat time!
Common Pitfalls
Over-Normalizing
-- WRONG: Destroying meaningful information
SELECT
LOWER(name) as name
FROM users;
-- "john@example.com" vs "JOHN@example.com" might be different users
-- RIGHT: Standardize for matching, preserve for display
SELECT
name as name_display,
LOWER(email) as email_for_matching
FROM users;
Not Handling NULLs
-- TRIM on NULL returns NULL, not empty string
SELECT TRIM(NULL); -- Returns NULL
-- Handle NULLs explicitly
SELECT
COALESCE(TRIM(name), 'Unknown') as name
FROM customers;
Exercises
Exercise 1
Write a query to standardize all customer names to proper case (INITCAP) and remove extra whitespace.
Exercise 2
Write a query to clean and standardize phone numbers to format (XXX) XXX-XXXX.
Exercise 3
Why might you want to keep the original text AND add a standardized column?
Exercise 4
Write a query to standardize product categories using multiple matching criteria.
Exercise 5
What challenges arise when standardizing international addresses?
Key Takeaways
- Text inconsistency causes JOIN failures and incorrect aggregations
- Always TRIM before comparing text
- Use LOWER for matching, INITCAP for display
- Create mapping tables for canonical categories
- Keep original and standardized versions
- Document standardization rules for reproducibility