Data Validation Before Analysis
Data Validation Before Analysis
Technical Explanation
Data validation ensures data is accurate, complete, and reliable before you trust it for analysis. Skipping validation leads to incorrect conclusions and bad business decisions.
Validation Framework
| Check | What It Validates | Example |
|---|---|---|
| Completeness | No missing critical values | All orders have customer_id |
| Consistency | Data follows business rules | Date not in the future |
| Accuracy | Data reflects reality | Email has @ symbol |
| Uniqueness | No unexpected duplicates | One order per id |
| Timeliness | Data is current enough | Pipeline ran today |
Validation Principles
- Trust but verify — Don't assume data is correct
- Validate at source — Catch issues where they enter
- Document findings — Know what you've checked
Code Examples
Using the CatCafe dataset:
-- 1. Completeness: Check for NULLs in critical fields
SELECT
'customers' as table_name,
COUNT(*) as total_rows,
COUNT(*) - COUNT(id) as id_nulls,
COUNT(*) - COUNT(email) as email_nulls,
COUNT(*) - COUNT(registration_date) as reg_date_nulls,
COUNT(*) - COUNT(total_spent) as total_spent_nulls
FROM customers
UNION ALL
SELECT
'orders',
COUNT(*),
COUNT(*) - COUNT(id),
COUNT(*) - COUNT(customer_id),
COUNT(*) - COUNT(order_date),
COUNT(*) - COUNT(total_amount)
FROM orders;
-- 2. Consistency: Business rules validation
-- Orders should have positive amounts
SELECT
COUNT(*) as total_orders,
COUNT(*) FILTER (WHERE total_amount <= 0) as invalid_amount,
COUNT(*) FILTER (WHERE total_amount < 0) as negative_amount,
COUNT(*) FILTER (WHERE total_amount = 0) as zero_amount
FROM orders;
-- Dates should not be in the future
SELECT
COUNT(*) as total_orders,
COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) as future_dates,
COUNT(*) FILTER (WHERE order_date < '2020-01-01') as very_old_dates
FROM orders;
-- Customer_id should reference valid customers
SELECT
COUNT(*) as total_orders,
COUNT(*) FILTER (
WHERE customer_id NOT IN (SELECT id FROM customers)
) as invalid_customer_refs
FROM orders;
-- 3. Accuracy: Data format validation
-- Emails should contain @
SELECT
COUNT(*) as total_customers,
COUNT(*) FILTER (
WHERE email NOT LIKE '%@%'
) as invalid_emails
FROM customers;
-- Phone numbers should be numeric (or proper format)
SELECT
COUNT(*) as total_customers,
COUNT(*) FILTER (
WHERE phone IS NOT NULL
AND phone !~ '^[0-9()\- ]+$' -- PostgreSQL regex
) as invalid_phones
FROM customers;
-- 4. Uniqueness: Primary key validation
SELECT
'customers' as table_name,
COUNT(*) as total_rows,
COUNT(DISTINCT id) as unique_ids,
COUNT(*) - COUNT(DISTINCT id) as duplicate_ids
FROM customers
UNION ALL
SELECT
'orders',
COUNT(*),
COUNT(DISTINCT id),
COUNT(*) - COUNT(DISTINCT id)
FROM orders;
-- 5. Timeliness: Check if data is current
SELECT
'orders' as table_name,
MIN(order_date) as oldest_record,
MAX(order_date) as newest_record,
MAX(order_date) - CURRENT_DATE as days_since_last_record
FROM orders;
-- Comprehensive validation report
WITH validation_results AS (
SELECT
'orders' as entity,
CASE
WHEN COUNT(*) FILTER (WHERE total_amount <= 0) > 0
THEN 'FAIL: Invalid amounts found'
WHEN COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) > 0
THEN 'FAIL: Future dates found'
WHEN COUNT(*) FILTER (
WHERE customer_id NOT IN (SELECT id FROM customers)
) > 0 THEN 'FAIL: Invalid customer references'
WHEN COUNT(DISTINCT id) != COUNT(*)
THEN 'FAIL: Duplicate IDs found'
ELSE 'PASS'
END as validation_status,
COUNT(*) as total_records
FROM orders
)
SELECT
entity,
validation_status,
total_records
FROM validation_results;
The Cat Analogy
Before a cat show, you validate each cat:
1. Completeness: Does it have a name? Microchip? Vaccination record?
Missing microchip → Can't compete, must find owner first
2. Consistency: Is age consistent with registration?
Age says "2 years" but records show "born 2015" → Data error!
3. Accuracy: Does weight seem right?
Listed weight 5 lbs but cat clearly weighs 15 lbs → Measurement error
4. Uniqueness: No duplicate entries?
Same cat registered twice → Cancel one entry
5. Timeliness: Is vaccination current?
Last vaccine 3 years ago → Need update before show
You don't just assume all cats are valid — you check!
Common Pitfalls
Assuming ETL Validation is Enough
-- WRONG: Blindly trusting imported data
SELECT * FROM orders;
-- Assumed ETL caught all issues
-- RIGHT: Validate yourself before analysis
WITH issues AS (
SELECT
COUNT(*) FILTER (WHERE total_amount <= 0) +
COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) +
COUNT(*) FILTER (WHERE customer_id NOT IN (SELECT id FROM customers))
as total_issues
FROM orders
)
SELECT
*,
CASE WHEN total_issues > 0
THEN 'WARNING: Data quality issues found'
ELSE 'Data appears clean'
END as data_quality_status
FROM issues;
Exercises
Exercise 1
Write a validation query to check that all orders have valid customer references and positive amounts.
Exercise 2
What is the difference between completeness and accuracy validation?
Exercise 3
Write a query to create a validation report for all tables showing pass/fail status.
Exercise 4
Why should you validate data even if the ETL process claims to validate it?
Exercise 5
Write a query to check for duplicate records in the orders table using multiple columns.
Key Takeaways
- Always validate before trusting data for analysis
- Key checks: completeness, consistency, accuracy, uniqueness, timeliness
- Document validation results with timestamps
- Don't assume ETL validation is sufficient
- Create automated validation reports
- Fix issues at the source when possible