Outliers Meaning
Outliers Meaning
Technical Explanation
Outliers are data points significantly different from other observations. They can indicate data entry errors, rare events worth investigating, or genuinely unusual behavior. The key is determining whether an outlier is noise or signal.
Types of Outliers
| Type | Description | Action |
|---|---|---|
| Measurement error | Data recorded incorrectly | Fix or remove |
| Sampling error | Data from wrong population | Filter |
| Natural variation | Genuinely rare events | Investigate |
| Intentional | Fraud, gaming | Flag and investigate |
Detection Methods
- IQR (Interquartile Range): values < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
- Standard deviation: values > μ ± 3σ
- Domain knowledge: values outside expected range
Code Examples
Using the CatCafe dataset:
-- Table: orders
-- id, customer_id, cat_id, order_date, total_amount, status
-- Step 1: Identify potential outliers using IQR
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) as q1,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_amount) as q2_median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) as q3
FROM orders
WHERE status = 'completed'
),
stats AS (
SELECT
q1, q3,
q3 - q1 as iqr,
q1 - 1.5 * (q3 - q1) as lower_bound,
q3 + 1.5 * (q3 - q1) as upper_bound
FROM quartiles
)
SELECT
id,
total_amount,
CASE
WHEN total_amount < stats.lower_bound THEN 'Low Outlier'
WHEN total_amount > stats.upper_bound THEN 'High Outlier'
ELSE 'Normal'
END as outlier_status
FROM orders, stats
WHERE status = 'completed'
ORDER BY total_amount DESC;
-- Step 2: Using standard deviation
WITH stats AS (
SELECT
AVG(total_amount) as mean,
STDDEV(total_amount) as stddev
FROM orders
WHERE status = 'completed'
)
SELECT
id,
total_amount,
(total_amount - stats.mean) / NULLIF(stats.stddev, 0) as z_score,
CASE
WHEN ABS((total_amount - stats.mean) / NULLIF(stats.stddev, 0)) > 3
THEN 'Outlier (|z| > 3)'
WHEN ABS((total_amount - stats.mean) / NULLIF(stats.stddev, 0)) > 2
THEN 'Potential Outlier (|z| > 2)'
ELSE 'Normal'
END as status
FROM orders, stats
WHERE status = 'completed'
ORDER BY ABS((total_amount - stats.mean) / NULLIF(stats.stddev, 0)) DESC;
-- Step 3: Investigate outliers - are they errors or real?
SELECT
o.*,
c.name as customer_name,
c.segment as customer_segment
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.total_amount > 500 -- Suspiciously high
ORDER BY o.total_amount DESC;
-- Step 4: Example of valid extreme values (VIP customers)
SELECT
o.id,
o.total_amount,
c.name,
c.segment,
o.order_date,
COUNT(*) OVER (PARTITION BY c.customer_id) as customer_order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'completed'
AND o.total_amount > 200
ORDER BY o.total_amount DESC;
-- Many high-value orders are from VIP customers - legitimate outliers!
-- Step 5: Outlier flagging with context
WITH stats AS (
SELECT
AVG(total_amount) as mean,
STDDEV(total_amount) as stddev,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) as q3
FROM orders
WHERE status = 'completed'
)
SELECT
o.id,
o.total_amount,
CASE
-- Genuinely unusual
WHEN total_amount > 1000 THEN 'Investigate: Extreme value'
-- Potential error (unusually high for segment)
WHEN total_amount > stats.q3 + 3 * (stats.q3 - stats.q1) THEN 'Flag: Possible error'
-- Normal high value
WHEN total_amount > stats.q3 + 1.5 * (stats.q3 - stats.q1) THEN 'High value: Review'
ELSE 'Normal'
END as analysis
FROM orders o, stats
WHERE o.status = 'completed'
ORDER BY o.total_amount DESC;
The Cat Analogy
Outliers in cat weights:
Normal cats: 8-12 lbs
Outlier: 25 lbs → Could be:
- Measurement error (scale broken)
- Actually a very large breed (Maine Coon)
- Medical condition (obesity)
- Data entry error (meant 2.5 lbs)
The 25 lb cat isn't automatically bad data — it might be a legitimate Maine Coon! But it warrants investigation.
Why not just remove outliers?
If you remove all "too large" cats:
- You miss a real health crisis (obesity epidemic)
- You miss legitimate breed differences
- You hide important signal
Common Pitfalls
Automatically Removing Outliers
-- WRONG: Automatically deleting outliers
DELETE FROM orders
WHERE total_amount > (
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_amount)
FROM orders
);
-- This destroys legitimate high-value orders!
-- RIGHT: Flag and investigate
SELECT *,
CASE WHEN total_amount > 1000 THEN 'Investigate' ELSE 'Normal' END
FROM orders;
Ignoring Natural Variation
-- Some "outliers" are just your best customers
SELECT
c.name,
SUM(o.total_amount) as lifetime_value,
-- Not an outlier to remove, a VIP to celebrate!
CASE WHEN SUM(o.total_amount) > 10000 THEN 'VIP' ELSE 'Standard' END
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name;
Exercises
Exercise 1
Write a query to identify outliers in order amounts using the IQR method.
Exercise 2
What is the difference between an outlier caused by a data error vs a genuine extreme value?
Exercise 3
Write a query to find orders that are outliers compared to the customer's average, not global averages.
Exercise 4
Why might you want to include outliers in some analyses but exclude them in others?
Exercise 5
Write a query to flag orders that are outliers for their day of week and time of day.
Key Takeaways
- Outliers can be errors, natural variation, or genuine signal
- Use IQR or standard deviation to detect outliers
- Investigate before removing — context matters
- Some outliers (VIP customers) are valuable, not problems
- Domain knowledge helps interpret outliers correctly
- Outlier detection should be documented and reproducible