Date Validation
Date Validation
Technical Explanation
Date validation ensures date values are valid, consistent, and within expected ranges. Invalid dates can indicate data entry errors, timezone issues, or pipeline problems.
Common Date Issues
| Issue | Example | Problem |
|---|---|---|
| Future dates | order_date = 2099 | Data entry error |
| Preposterous dates | birth_date = 1850 | Typo or format error |
| Format inconsistencies | "01/02/2024" vs "Jan 2, 2024" | Ambiguous interpretation |
| Invalid dates | Feb 30, 2024 | Doesn't exist |
| Timezone issues | Same event = different dates | Reporting inconsistencies |
Validation Rules
- Date within expected range
- Date not in future (unless scheduled)
- Date format consistent
- No invalid dates (e.g., Feb 30)
Code Examples
Using the CatCafe dataset:
-- Table: orders (id, customer_id, order_date, created_at)
-- Table: customers (id, registration_date, birth_date)
-- 1. Check for future dates (likely errors)
SELECT
COUNT(*) as total_orders,
COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) as future_dates,
COUNT(*) FILTER (
WHERE order_date > CURRENT_DATE + INTERVAL '1 day'
) as clearly_wrong
FROM orders;
-- Find the problematic records
SELECT
id,
order_date,
customer_id
FROM orders
WHERE order_date > CURRENT_DATE
ORDER BY order_date DESC;
-- 2. Check for preposterously old dates
SELECT
MIN(registration_date) as earliest_date,
MAX(registration_date) as latest_date
FROM customers;
-- If earliest is before 2000, investigate
-- Find suspicious birth dates
SELECT
id,
name,
birth_date,
AGE(birth_date) as age -- PostgreSQL
FROM customers
WHERE birth_date < '2000-01-01'
OR birth_date > CURRENT_DATE;
-- 3. Check for invalid date formats (string dates)
-- Assuming date_stored_as_text column
SELECT
COUNT(*) as total,
COUNT(*) FILTER (
WHERE date_text !~ '^\d{4}-\d{2}-\d{2}$'
) as invalid_format
FROM customer_events; -- If stored as text
-- 4. Validate date ranges within business context
-- Orders should be after customer registration
SELECT
COUNT(*) as total_orders,
COUNT(*) FILTER (
WHERE o.order_date < c.registration_date
) as orders_before_registration
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- Find violations
SELECT
o.id as order_id,
o.order_date,
c.name,
c.registration_date
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date < c.registration_date;
-- 5. Check for incomplete dates
SELECT
COUNT(*) as total_records,
COUNT(*) FILTER (
WHERE EXTRACT(YEAR FROM order_date) IS NULL
) as missing_year,
COUNT(*) FILTER (
WHERE EXTRACT(MONTH FROM order_date) IS NULL
) as missing_month,
COUNT(*) FILTER (
WHERE EXTRACT(DAY FROM order_date) IS NULL
) as missing_day
FROM orders;
-- 6. Timezone validation (if applicable)
-- Check if created_at and order_date are on same day (timezone edge cases)
SELECT
COUNT(*) as total,
COUNT(*) FILTER (
WHERE order_date::DATE != created_at::DATE
) as different_date
FROM orders;
-- 7. Comprehensive date validation report
WITH date_validation AS (
SELECT
'orders' as table_name,
COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) as future_dates,
COUNT(*) FILTER (WHERE order_date < '2020-01-01') as very_old,
COUNT(*) FILTER (WHERE order_date IS NULL) as null_dates
FROM orders
UNION ALL
SELECT
'customers',
0, -- No future dates expected for birth dates
COUNT(*) FILTER (WHERE birth_date < '1950-01-01'),
COUNT(*) FILTER (WHERE birth_date IS NULL)
FROM customers
)
SELECT
table_name,
future_dates,
very_old,
null_dates,
CASE
WHEN future_dates > 0 OR very_old > 0 OR null_dates > 0
THEN 'FAIL'
ELSE 'PASS'
END as validation_status
FROM date_validation;
The Cat Analogy
Cat birth dates need validation:
Whiskers: Born "2025-01-15" → Future date! ERROR
Mochi: Born "1850-03-20" → 175 years old? ERROR
Luna: Born "Feb 30, 2024" → Feb doesn't have 30 days! ERROR
Shadow: Born "" → Missing! ERROR
Validations to run:
- Not in the future (cats born tomorrow don't exist)
- Not impossibly old (175-year-old cats are mythical)
- Actually exists (February 30th is not a real date)
- Not missing (birth date is required)
Common Pitfalls
Assuming Date Formats
-- WRONG: Assuming standard format
SELECT *
FROM events
WHERE event_date >= '2024-01-01'; -- May fail if format is different
-- RIGHT: Use explicit casting
SELECT *
FROM events
WHERE event_date::DATE >= '2024-01-01'::DATE;
Ignoring Timezone
-- Same order logged at different times appears on different days
SELECT
order_id,
created_at,
created_at AT TIME ZONE 'UTC' as utc_date
FROM orders;
-- Always normalize timezones for reporting
Exercises
Exercise 1
Write a query to find orders with dates in the future or before 2020.
Exercise 2
Write a query to check if any orders occurred before the customer registered.
Exercise 3
What date validation would you add for a birth_date field?
Exercise 4
Write a query to find records with missing or NULL dates.
Exercise 5
Why might the same event have different dates depending on timezone?
Key Takeaways
- Check for future dates (likely errors)
- Verify dates are within reasonable ranges
- Handle timezone inconsistencies
- Use proper date casting, don't assume formats
- Validate dates are real (Feb 30 doesn't exist)
- Create validation reports for audit trails