Home/Data Analyst/Data Cleaning & Validation

Missing Values Handling

Missing Values Handling

Technical Explanation

Missing values (NULLs) in data can occur due to optional fields, failed data collection, API limitations, or transformation errors. Handling them correctly is critical—ignoring NULLs can lead to incorrect analysis.

Types of Missing Data

Type Description Example
MCAR Missing completely at random Sensor glitch
MAR Missing at random (by other variables) Older people skip age question
MNAR Missing not at random High earners skip income

Handling Strategies

Strategy When to Use SQL Function
Exclude Missing is meaningful or small % WHERE col IS NOT NULL
Impute with default Missing means "none" COALESCE(col, 0)
Impute with average Missing is random COALESCE(col, AVG(col))
Flag and impute Need to track missingness CASE WHEN col IS NULL THEN 1 ELSE 0 END

Code Examples

Using the CatCafe dataset:

-- Table: customers
-- id, name, email, phone, city, registration_date, total_spent

-- Step 1: Assess missing data
SELECT
    COUNT(*) as total_rows,
    COUNT(*) - COUNT(phone) as phone_missing,
    COUNT(*) - COUNT(city) as city_missing,
    COUNT(*) - COUNT(total_spent) as total_spent_missing,
    COUNT(phone) * 100.0 / COUNT(*) as phone_pct,
    COUNT(city) * 100.0 / COUNT(*) as city_pct
FROM customers;

-- Step 2: Understand why data is missing
SELECT
    CASE WHEN phone IS NULL THEN 'Missing' ELSE 'Present' END as phone_status,
    AVG(total_spent) as avg_spent,
    COUNT(*) as customers
FROM customers
GROUP BY CASE WHEN phone IS NULL THEN 'Missing' ELSE 'Present' END;
-- Are customers without phones lower spenders? (MNAR?)

-- Strategy 1: Exclude NULLs when they're not relevant
SELECT
    c.name,
    c.email
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.phone IS NOT NULL;  -- Only customers we can contact

-- Strategy 2: COALESCE for default values
SELECT
    name,
    city,
    COALESCE(phone, 'No phone on file') as phone,
    COALESCE(total_spent, 0) as total_spent
FROM customers;

-- Strategy 3: Flag + Impute (track missingness)
SELECT
    name,
    CASE WHEN phone IS NULL THEN 1 ELSE 0 END as phone_missing_flag,
    COALESCE(phone, 'Unknown') as phone,
    CASE WHEN total_spent IS NULL THEN 0 ELSE total_spent END as total_spent_imputed
FROM customers;

-- Strategy 4: Impute with average (use carefully)
WITH avg_values AS (
    SELECT
        AVG(total_spent) as avg_spent
    FROM customers
    WHERE total_spent IS NOT NULL
)
SELECT
    name,
    COALESCE(total_spent, avg_values.avg_spent) as total_spent
FROM customers, avg_values;

-- Complex: Different imputation by segment
SELECT
    name,
    segment,
    total_spent,
    CASE
        WHEN total_spent IS NULL AND segment = 'Premium'
            THEN (SELECT AVG(total_spent) FROM customers WHERE segment = 'Premium' AND total_spent IS NOT NULL)
        WHEN total_spent IS NULL AND segment = 'Standard'
            THEN (SELECT AVG(total_spent) FROM customers WHERE segment = 'Standard' AND total_spent IS NOT NULL)
        ELSE total_spent
    END as total_spent_imputed
FROM customers;

-- Window functions with NULL handling
SELECT
    name,
    total_spent,
    AVG(total_spent) OVER () as overall_avg,
    total_spent - AVG(total_spent) OVER () as vs_avg,
    -- Only calculate difference for non-NULL values
    CASE WHEN total_spent IS NOT NULL
         THEN total_spent - AVG(total_spent) OVER ()
         ELSE NULL
    END as vs_avg_imputed
FROM customers;

The Cat Analogy

Missing cat data is like unknown cat behaviors:

MCAR (Missing Completely at Random):

Cat's favorite toy wasn't recorded
→ The missingness doesn't mean anything
→ Any imputation approach works

MAR (Missing at Random):

Older cats less likely to have "energy level" recorded
→ Missingness correlates with age
→ Impute based on age groups

MNAR (Missing Not at Random):

Cats with health issues less likely to have "activity level" recorded
→ Missingness IS the signal!
→ Don't impute, investigate why

Example handling:

  • Cat weight unknown: Impute with average (random)
  • Adoption fee unknown: COALESCE to 0 (means "free adoption")
  • Cat name unknown: Flag as "unnamed" (not random)

Common Pitfalls

Treating NULL as Zero or Empty String

-- WRONG: NULL is not 0
SELECT
    SUM(total_spent) / COUNT(*)  -- COUNT counts all rows
FROM customers;
-- vs
SELECT
    SUM(total_spent) / COUNT(total_spent)  -- COUNT ignores NULLs
FROM customers;
-- Different denominators!

-- RIGHT: Be explicit
SELECT
    SUM(COALESCE(total_spent, 0)) / COUNT(*)  -- Treat NULL as 0
FROM customers;

Imputing Without Checking Missingness Pattern

-- WRONG: Impute all NULLs with mean without checking pattern
SELECT
    AVG(COALESCE(total_spent, (SELECT AVG(total_spent) FROM customers)))
FROM customers;
-- This hides that maybe 50% of values are missing!

-- RIGHT: First understand WHY values are missing
SELECT
    COUNT(*) as total,
    COUNT(total_spent) as present,
    COUNT(*) - COUNT(total_spent) as missing,
    -- Check if missingness correlates with something
    AVG(CASE WHEN total_spent IS NULL THEN 1.0 ELSE 0.0 END) as missing_rate
FROM customers
GROUP BY city;

Exercises

Exercise 1

Analyze the missing value pattern in the customers table. Which columns have missing values and at what rate?

Exercise 2

Write a query that flags customers with missing phone numbers and imputes total_spent with segment averages.

Exercise 3

Why is it dangerous to impute all missing values with the global average?

Exercise 4

When would you choose to exclude rows with missing values vs imputing?

Exercise 5

Write a query using NULLIF to handle division when values might be zero or NULL.


Key Takeaways

  • Understand WHY data is missing (MCAR, MAR, MNAR) before choosing strategy
  • COALESCE replaces NULL with a value (use for "none" or "unknown")
  • Imputing with averages works for MCAR, not MNAR
  • Flag missingness when it might be informative
  • COUNT(*) vs COUNT(column) handle NULLs differently
  • Always assess missing data percentage before imputation