Home/Data Analyst/Data Cleaning & Validation

Date Validation

Date Validation

Technical Explanation

Date validation ensures date values are valid, consistent, and within expected ranges. Invalid dates can indicate data entry errors, timezone issues, or pipeline problems.

Common Date Issues

Issue Example Problem
Future dates order_date = 2099 Data entry error
Preposterous dates birth_date = 1850 Typo or format error
Format inconsistencies "01/02/2024" vs "Jan 2, 2024" Ambiguous interpretation
Invalid dates Feb 30, 2024 Doesn't exist
Timezone issues Same event = different dates Reporting inconsistencies

Validation Rules

  • Date within expected range
  • Date not in future (unless scheduled)
  • Date format consistent
  • No invalid dates (e.g., Feb 30)

Code Examples

Using the CatCafe dataset:

-- Table: orders (id, customer_id, order_date, created_at)
-- Table: customers (id, registration_date, birth_date)

-- 1. Check for future dates (likely errors)
SELECT
    COUNT(*) as total_orders,
    COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) as future_dates,
    COUNT(*) FILTER (
        WHERE order_date > CURRENT_DATE + INTERVAL '1 day'
    ) as clearly_wrong
FROM orders;

-- Find the problematic records
SELECT
    id,
    order_date,
    customer_id
FROM orders
WHERE order_date > CURRENT_DATE
ORDER BY order_date DESC;

-- 2. Check for preposterously old dates
SELECT
    MIN(registration_date) as earliest_date,
    MAX(registration_date) as latest_date
FROM customers;
-- If earliest is before 2000, investigate

-- Find suspicious birth dates
SELECT
    id,
    name,
    birth_date,
    AGE(birth_date) as age  -- PostgreSQL
FROM customers
WHERE birth_date < '2000-01-01'
    OR birth_date > CURRENT_DATE;

-- 3. Check for invalid date formats (string dates)
-- Assuming date_stored_as_text column
SELECT
    COUNT(*) as total,
    COUNT(*) FILTER (
        WHERE date_text !~ '^\d{4}-\d{2}-\d{2}$'
    ) as invalid_format
FROM customer_events;  -- If stored as text

-- 4. Validate date ranges within business context
-- Orders should be after customer registration
SELECT
    COUNT(*) as total_orders,
    COUNT(*) FILTER (
        WHERE o.order_date < c.registration_date
    ) as orders_before_registration
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Find violations
SELECT
    o.id as order_id,
    o.order_date,
    c.name,
    c.registration_date
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date < c.registration_date;

-- 5. Check for incomplete dates
SELECT
    COUNT(*) as total_records,
    COUNT(*) FILTER (
        WHERE EXTRACT(YEAR FROM order_date) IS NULL
    ) as missing_year,
    COUNT(*) FILTER (
        WHERE EXTRACT(MONTH FROM order_date) IS NULL
    ) as missing_month,
    COUNT(*) FILTER (
        WHERE EXTRACT(DAY FROM order_date) IS NULL
    ) as missing_day
FROM orders;

-- 6. Timezone validation (if applicable)
-- Check if created_at and order_date are on same day (timezone edge cases)
SELECT
    COUNT(*) as total,
    COUNT(*) FILTER (
        WHERE order_date::DATE != created_at::DATE
    ) as different_date
FROM orders;

-- 7. Comprehensive date validation report
WITH date_validation AS (
    SELECT
        'orders' as table_name,
        COUNT(*) FILTER (WHERE order_date > CURRENT_DATE) as future_dates,
        COUNT(*) FILTER (WHERE order_date < '2020-01-01') as very_old,
        COUNT(*) FILTER (WHERE order_date IS NULL) as null_dates
    FROM orders
    UNION ALL
    SELECT
        'customers',
        0,  -- No future dates expected for birth dates
        COUNT(*) FILTER (WHERE birth_date < '1950-01-01'),
        COUNT(*) FILTER (WHERE birth_date IS NULL)
    FROM customers
)
SELECT
    table_name,
    future_dates,
    very_old,
    null_dates,
    CASE
        WHEN future_dates > 0 OR very_old > 0 OR null_dates > 0
            THEN 'FAIL'
        ELSE 'PASS'
    END as validation_status
FROM date_validation;

The Cat Analogy

Cat birth dates need validation:

Whiskers: Born "2025-01-15" → Future date! ERROR
Mochi: Born "1850-03-20" → 175 years old? ERROR
Luna: Born "Feb 30, 2024" → Feb doesn't have 30 days! ERROR
Shadow: Born "" → Missing! ERROR

Validations to run:

  • Not in the future (cats born tomorrow don't exist)
  • Not impossibly old (175-year-old cats are mythical)
  • Actually exists (February 30th is not a real date)
  • Not missing (birth date is required)

Common Pitfalls

Assuming Date Formats

-- WRONG: Assuming standard format
SELECT *
FROM events
WHERE event_date >= '2024-01-01';  -- May fail if format is different

-- RIGHT: Use explicit casting
SELECT *
FROM events
WHERE event_date::DATE >= '2024-01-01'::DATE;

Ignoring Timezone

-- Same order logged at different times appears on different days
SELECT
    order_id,
    created_at,
    created_at AT TIME ZONE 'UTC' as utc_date
FROM orders;
-- Always normalize timezones for reporting

Exercises

Exercise 1

Write a query to find orders with dates in the future or before 2020.

Exercise 2

Write a query to check if any orders occurred before the customer registered.

Exercise 3

What date validation would you add for a birth_date field?

Exercise 4

Write a query to find records with missing or NULL dates.

Exercise 5

Why might the same event have different dates depending on timezone?


Key Takeaways

  • Check for future dates (likely errors)
  • Verify dates are within reasonable ranges
  • Handle timezone inconsistencies
  • Use proper date casting, don't assume formats
  • Validate dates are real (Feb 30 doesn't exist)
  • Create validation reports for audit trails