Home/Data Analyst/Data Cleaning & Validation

Source vs Target Reconciliation

Source vs Target Reconciliation

Technical Explanation

Reconciliation verifies that data loaded from source systems matches the original data after ETL (Extract, Transform, Load) processes. It's a critical data quality check to ensure no data was lost, duplicated, or corrupted.

Why Reconciliation Matters

Source: 1,000 rows
↓
ETL Process
↓
Target: 1,000 rows? Or 999? Or 1,001?

Without reconciliation: You don't know!

What to Reconcile

Metric Check
Row counts Source rows = Target rows
Column sums SUM(revenue) matches
Key counts COUNT(DISTINCT customer_id) matches
Min/Max Date ranges preserved

Code Examples

Using the CatCafe dataset:

-- Simulating source and target tables

-- Source table (raw data before ETL)
-- CREATE TABLE orders_source AS SELECT * FROM orders;

-- Target table (after ETL transformation)
-- CREATE TABLE orders_target AS SELECT * FROM orders_etl;

-- 1. Row count reconciliation
WITH source_counts AS (
    SELECT COUNT(*) as source_row_count FROM orders_source
),
target_counts AS (
    SELECT COUNT(*) as target_row_count FROM orders_target
)
SELECT
    source_counts.source_row_count,
    target_counts.target_row_count,
    source_counts.source_row_count - target_counts.target_row_count as row_diff,
    CASE
        WHEN source_counts.source_row_count = target_counts.target_row_count
            THEN 'PASS: Row count matches'
        ELSE 'FAIL: Row count mismatch'
    END as status
FROM source_counts, target_counts;

-- 2. Sum reconciliation (critical for financial data)
WITH source_sums AS (
    SELECT
        SUM(total_amount) as total_revenue,
        SUM(CASE WHEN status = 'completed' THEN total_amount ELSE 0 END) as completed_revenue
    FROM orders_source
),
target_sums AS (
    SELECT
        SUM(total_amount) as total_revenue,
        SUM(CASE WHEN status = 'completed' THEN total_amount ELSE 0 END) as completed_revenue
    FROM orders_target
)
SELECT
    source_sums.total_revenue as source_revenue,
    target_sums.total_revenue as target_revenue,
    source_sums.total_revenue - target_sums.total_revenue as revenue_diff,
    CASE
        WHEN source_sums.total_revenue = target_sums.total_revenue
            THEN 'PASS'
        ELSE 'FAIL'
    END as status
FROM source_sums, target_sums;

-- 3. Distinct count reconciliation (for keys)
WITH source_counts AS (
    SELECT
        COUNT(DISTINCT customer_id) as unique_customers,
        COUNT(DISTINCT status) as unique_statuses
    FROM orders_source
),
target_counts AS (
    SELECT
        COUNT(DISTINCT customer_id) as unique_customers,
        COUNT(DISTINCT status) as unique_statuses
    FROM orders_target
)
SELECT
    source_counts.unique_customers as source_customers,
    target_counts.unique_customers as target_customers,
    source_counts.unique_statuses as source_statuses,
    target_counts.unique_statuses as target_statuses,
    CASE
        WHEN source_counts.unique_customers = target_counts.unique_customers
            THEN 'PASS: Customer count matches'
        ELSE 'FAIL: Customer count mismatch'
    END as customer_status,
    CASE
        WHEN source_counts.unique_statuses = target_counts.unique_statuses
            THEN 'PASS: Status values match'
        ELSE 'FAIL: Status values mismatch'
    END as status_status
FROM source_counts, target_counts;

-- 4. Min/Max reconciliation (data range check)
WITH source_stats AS (
    SELECT
        MIN(order_date) as min_date,
        MAX(order_date) as max_date,
        MIN(total_amount) as min_amount,
        MAX(total_amount) as max_amount
    FROM orders_source
),
target_stats AS (
    SELECT
        MIN(order_date) as min_date,
        MAX(order_date) as max_date,
        MIN(total_amount) as min_amount,
        MAX(total_amount) as max_amount
    FROM orders_target
)
SELECT
    source_stats.min_date as source_min_date,
    target_stats.min_date as target_min_date,
    source_stats.max_date as source_max_date,
    target_stats.max_date as target_max_date,
    CASE
        WHEN source_stats.min_date = target_stats.min_date
         AND source_stats.max_date = target_stats.max_date
            THEN 'PASS: Date ranges match'
        ELSE 'FAIL: Date range mismatch'
    END as date_status
FROM source_stats, target_stats;

-- 5. Missing key reconciliation
SELECT
    COUNT(*) as missing_in_target
FROM orders_source s
WHERE NOT EXISTS (
    SELECT 1 FROM orders_target t
    WHERE t.id = s.id
);
-- Rows in source but not in target (lost in ETL)

SELECT
    COUNT(*) as extra_in_target
FROM orders_target t
WHERE NOT EXISTS (
    SELECT 1 FROM orders_source s
    WHERE s.id = t.id
);
-- Rows in target but not in source (added incorrectly)

-- 6. Comprehensive reconciliation report
WITH reconciliation AS (
    SELECT
        'Row Count' as check_type,
        (SELECT COUNT(*) FROM orders_source) as source_value,
        (SELECT COUNT(*) FROM orders_target) as target_value,
        (SELECT COUNT(*) FROM orders_source) -
        (SELECT COUNT(*) FROM orders_target) as difference,
        CASE
            WHEN (SELECT COUNT(*) FROM orders_source) =
                 (SELECT COUNT(*) FROM orders_target)
                THEN 'PASS' ELSE 'FAIL'
        END as status
    UNION ALL
    SELECT
        'Revenue Sum',
        (SELECT SUM(total_amount) FROM orders_source),
        (SELECT SUM(total_amount) FROM orders_target),
        (SELECT SUM(total_amount) FROM orders_source) -
        (SELECT SUM(total_amount) FROM orders_target),
        CASE
            WHEN (SELECT SUM(total_amount) FROM orders_source) =
                 (SELECT SUM(total_amount) FROM orders_target)
                THEN 'PASS' ELSE 'FAIL'
        END
)
SELECT * FROM reconciliation;

The Cat Analogy

Shipping cats from one shelter to another:

Source Shelter:
  - 50 cats in cages
  - All healthy
  - All microchipped
  - Total value: $5,000

         ↓ Transport Truck
         ↓ (ETL Process)
         ↓

Target Shelter:
  - 49 cats in cages (one lost!)
  - 48 healthy, 1 sick (condition changed!)
  - 50 microchipped (microchip doesn't match!)
  - Total value: $4,900 (money lost!)

Without checking: "Cats delivered, all good!"
With reconciliation: "Missing 1 cat, 1 injured, 1 chip mismatch!"

Common Pitfalls

Only Checking Row Counts

-- WRONG: "1 million rows = 1 million rows, we're good!"
SELECT
    COUNT(*) = COUNT(*)
FROM source, target;
-- Data could be corrupted, sums could differ, values changed

-- RIGHT: Multiple reconciliation checks
-- Row count, sums, min/max, distinct counts

Exercises

Exercise 1

Write a reconciliation query to verify row counts and revenue sums between source and target.

Exercise 2

Write a query to find rows that exist in source but not in target.

Exercise 3

What metrics beyond row count should you reconcile?

Exercise 4

Write a reconciliation query for customer_id uniqueness between tables.

Exercise 5

How would you handle reconciliation when the ETL intentionally filters data?


Key Takeaways

  • Reconciliation ensures data integrity through ETL processes
  • Check: row counts, column sums, distinct counts, min/max ranges
  • Use tolerance for floating-point comparisons
  • Run reconciliation after every ETL run
  • Investigate all discrepancies immediately
  • Document reconciliation results with timestamps