Source vs Target Reconciliation
Source vs Target Reconciliation
Technical Explanation
Reconciliation verifies that data loaded from source systems matches the original data after ETL (Extract, Transform, Load) processes. It's a critical data quality check to ensure no data was lost, duplicated, or corrupted.
Why Reconciliation Matters
Source: 1,000 rows
↓
ETL Process
↓
Target: 1,000 rows? Or 999? Or 1,001?
Without reconciliation: You don't know!
What to Reconcile
| Metric | Check |
|---|---|
| Row counts | Source rows = Target rows |
| Column sums | SUM(revenue) matches |
| Key counts | COUNT(DISTINCT customer_id) matches |
| Min/Max | Date ranges preserved |
Code Examples
Using the CatCafe dataset:
-- Simulating source and target tables
-- Source table (raw data before ETL)
-- CREATE TABLE orders_source AS SELECT * FROM orders;
-- Target table (after ETL transformation)
-- CREATE TABLE orders_target AS SELECT * FROM orders_etl;
-- 1. Row count reconciliation
WITH source_counts AS (
SELECT COUNT(*) as source_row_count FROM orders_source
),
target_counts AS (
SELECT COUNT(*) as target_row_count FROM orders_target
)
SELECT
source_counts.source_row_count,
target_counts.target_row_count,
source_counts.source_row_count - target_counts.target_row_count as row_diff,
CASE
WHEN source_counts.source_row_count = target_counts.target_row_count
THEN 'PASS: Row count matches'
ELSE 'FAIL: Row count mismatch'
END as status
FROM source_counts, target_counts;
-- 2. Sum reconciliation (critical for financial data)
WITH source_sums AS (
SELECT
SUM(total_amount) as total_revenue,
SUM(CASE WHEN status = 'completed' THEN total_amount ELSE 0 END) as completed_revenue
FROM orders_source
),
target_sums AS (
SELECT
SUM(total_amount) as total_revenue,
SUM(CASE WHEN status = 'completed' THEN total_amount ELSE 0 END) as completed_revenue
FROM orders_target
)
SELECT
source_sums.total_revenue as source_revenue,
target_sums.total_revenue as target_revenue,
source_sums.total_revenue - target_sums.total_revenue as revenue_diff,
CASE
WHEN source_sums.total_revenue = target_sums.total_revenue
THEN 'PASS'
ELSE 'FAIL'
END as status
FROM source_sums, target_sums;
-- 3. Distinct count reconciliation (for keys)
WITH source_counts AS (
SELECT
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(DISTINCT status) as unique_statuses
FROM orders_source
),
target_counts AS (
SELECT
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(DISTINCT status) as unique_statuses
FROM orders_target
)
SELECT
source_counts.unique_customers as source_customers,
target_counts.unique_customers as target_customers,
source_counts.unique_statuses as source_statuses,
target_counts.unique_statuses as target_statuses,
CASE
WHEN source_counts.unique_customers = target_counts.unique_customers
THEN 'PASS: Customer count matches'
ELSE 'FAIL: Customer count mismatch'
END as customer_status,
CASE
WHEN source_counts.unique_statuses = target_counts.unique_statuses
THEN 'PASS: Status values match'
ELSE 'FAIL: Status values mismatch'
END as status_status
FROM source_counts, target_counts;
-- 4. Min/Max reconciliation (data range check)
WITH source_stats AS (
SELECT
MIN(order_date) as min_date,
MAX(order_date) as max_date,
MIN(total_amount) as min_amount,
MAX(total_amount) as max_amount
FROM orders_source
),
target_stats AS (
SELECT
MIN(order_date) as min_date,
MAX(order_date) as max_date,
MIN(total_amount) as min_amount,
MAX(total_amount) as max_amount
FROM orders_target
)
SELECT
source_stats.min_date as source_min_date,
target_stats.min_date as target_min_date,
source_stats.max_date as source_max_date,
target_stats.max_date as target_max_date,
CASE
WHEN source_stats.min_date = target_stats.min_date
AND source_stats.max_date = target_stats.max_date
THEN 'PASS: Date ranges match'
ELSE 'FAIL: Date range mismatch'
END as date_status
FROM source_stats, target_stats;
-- 5. Missing key reconciliation
SELECT
COUNT(*) as missing_in_target
FROM orders_source s
WHERE NOT EXISTS (
SELECT 1 FROM orders_target t
WHERE t.id = s.id
);
-- Rows in source but not in target (lost in ETL)
SELECT
COUNT(*) as extra_in_target
FROM orders_target t
WHERE NOT EXISTS (
SELECT 1 FROM orders_source s
WHERE s.id = t.id
);
-- Rows in target but not in source (added incorrectly)
-- 6. Comprehensive reconciliation report
WITH reconciliation AS (
SELECT
'Row Count' as check_type,
(SELECT COUNT(*) FROM orders_source) as source_value,
(SELECT COUNT(*) FROM orders_target) as target_value,
(SELECT COUNT(*) FROM orders_source) -
(SELECT COUNT(*) FROM orders_target) as difference,
CASE
WHEN (SELECT COUNT(*) FROM orders_source) =
(SELECT COUNT(*) FROM orders_target)
THEN 'PASS' ELSE 'FAIL'
END as status
UNION ALL
SELECT
'Revenue Sum',
(SELECT SUM(total_amount) FROM orders_source),
(SELECT SUM(total_amount) FROM orders_target),
(SELECT SUM(total_amount) FROM orders_source) -
(SELECT SUM(total_amount) FROM orders_target),
CASE
WHEN (SELECT SUM(total_amount) FROM orders_source) =
(SELECT SUM(total_amount) FROM orders_target)
THEN 'PASS' ELSE 'FAIL'
END
)
SELECT * FROM reconciliation;
The Cat Analogy
Shipping cats from one shelter to another:
Source Shelter:
- 50 cats in cages
- All healthy
- All microchipped
- Total value: $5,000
↓ Transport Truck
↓ (ETL Process)
↓
Target Shelter:
- 49 cats in cages (one lost!)
- 48 healthy, 1 sick (condition changed!)
- 50 microchipped (microchip doesn't match!)
- Total value: $4,900 (money lost!)
Without checking: "Cats delivered, all good!"
With reconciliation: "Missing 1 cat, 1 injured, 1 chip mismatch!"
Common Pitfalls
Only Checking Row Counts
-- WRONG: "1 million rows = 1 million rows, we're good!"
SELECT
COUNT(*) = COUNT(*)
FROM source, target;
-- Data could be corrupted, sums could differ, values changed
-- RIGHT: Multiple reconciliation checks
-- Row count, sums, min/max, distinct counts
Exercises
Exercise 1
Write a reconciliation query to verify row counts and revenue sums between source and target.
Exercise 2
Write a query to find rows that exist in source but not in target.
Exercise 3
What metrics beyond row count should you reconcile?
Exercise 4
Write a reconciliation query for customer_id uniqueness between tables.
Exercise 5
How would you handle reconciliation when the ETL intentionally filters data?
Key Takeaways
- Reconciliation ensures data integrity through ETL processes
- Check: row counts, column sums, distinct counts, min/max ranges
- Use tolerance for floating-point comparisons
- Run reconciliation after every ETL run
- Investigate all discrepancies immediately
- Document reconciliation results with timestamps