Home/Data Analyst/SQL Fundamentals

COUNT vs COUNT DISTINCT

COUNT vs COUNT DISTINCT

Technical Explanation

Both COUNT(*) and COUNT(column) count rows, but they handle NULL values differently. COUNT(DISTINCT ...) adds another layer by counting only unique values.

The Three COUNT Variations

Function Counts NULLs Included
COUNT(*) Number of rows Yes (all rows)
COUNT(column) Non-NULL values in column No (ignores NULL)
COUNT(DISTINCT column) Unique non-NULL values No

Key Differences

-- COUNT(*): Counts every row
SELECT COUNT(*) FROM orders;  -- Total number of orders

-- COUNT(column): Counts only non-NULL values
SELECT COUNT(cat_id) FROM orders;  -- Orders where cat_id is not NULL

-- COUNT(DISTINCT column): Unique non-NULL values
SELECT COUNT(DISTINCT customer_id) FROM orders;  -- Unique customers who ordered

Why the Difference Matters

-- If orders has 100 rows:
-- - 10 have cat_id = NULL
-- - 5 have cat_id = NULL

SELECT
    COUNT(*),                      -- 100: all rows
    COUNT(cat_id),                 -- 90: non-NULL cat_ids
    COUNT(DISTINCT cat_id)         -- 15: unique non-NULL cat_ids
FROM orders;

Code Examples

Using the CatCafe dataset:

-- Table: orders
-- id, customer_id, cat_id, order_date, total_amount, status

-- COUNT(*): Total number of orders
SELECT COUNT(*) as total_orders
FROM orders;
-- Returns: Total rows regardless of NULLs

-- COUNT(DISTINCT customer_id): Unique customers who placed orders
SELECT COUNT(DISTINCT customer_id) as unique_customers
FROM orders;
-- Count of individual customers, not total orders

-- Finding customers who ordered multiple items
SELECT
    customer_id,
    COUNT(*) as order_count,
    COUNT(DISTINCT cat_id) as unique_cats_ordered
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > COUNT(DISTINCT cat_id);
-- Customers who ordered the same cat more than once

-- Orders per cat, showing adoption stats
SELECT
    cats.name,
    cats.breed,
    COUNT(DISTINCT orders.customer_id) as unique_buyers,
    COUNT(orders.id) as total_orders
FROM cats
LEFT JOIN orders
    ON cats.id = orders.cat_id
GROUP BY cats.id, cats.name, cats.breed
ORDER BY total_orders DESC;

-- Common pattern: Conversion rate calculation
SELECT
    COUNT(DISTINCT customer_id) as total_visitors,
    COUNT(DISTINCT CASE WHEN status = 'completed' THEN customer_id END) as converters,
    COUNT(DISTINCT CASE WHEN status = 'completed' THEN customer_id END) * 100.0
        / NULLIF(COUNT(DISTINCT customer_id), 0) as conversion_rate
FROM orders;

The Cat Analogy

You're counting cats at a cafe:

COUNT(*) = Total cats in the building

  • Cats in the play area, cats napping, cats hiding under furniture
  • "We have 20 cat encounters today"

COUNT(breed) = Cats with known breeds

  • Excludes cats where breed is unknown/not recorded
  • "We can identify the breed of 17 cats"

COUNT(DISTINCT breed) = Unique breeds present

  • Each breed counted only once
  • "We have 5 different breeds: Siamese, Persian, Maine Coon, Tabby, Sphynx"

Example scenario:

  • You saw 20 cat entries in your log
  • 3 entries had no cat (just people visiting)
  • 2 cats had no recorded breed
  • The 17 identifiable cats were: Siamese, Persian, Siamese, Maine Coon, Tabby, Siamese, Tabby, Tabby...
COUNT(*)              = 20 (total log entries)
COUNT(breed)          = 17 (entries with a cat)
COUNT(DISTINCT breed) = 4  (Siamese, Persian, Maine Coon, Tabby)

Common Mistakes

Confusing COUNT(*) and COUNT(column)

-- If total_spent has NULLs for some customers:
SELECT
    COUNT(*),           -- All customers
    COUNT(total_spent) -- Only customers with non-NULL spending
FROM customers;
-- These will differ if there are NULL values

Forgetting DISTINCT When Needed

-- "How many cats has each customer interacted with?"
-- WRONG: Counts total interactions, not unique cats
SELECT
    customer_id,
    COUNT(cat_id) as cat_count
FROM orders
GROUP BY customer_id;

-- RIGHT: Count unique cats
SELECT
    customer_id,
    COUNT(DISTINCT cat_id) as unique_cat_count
FROM orders
GROUP BY customer_id;

NULLIF to Avoid Division by Zero

-- Calculate percentage of orders with cats
SELECT
    COUNT(*) as total_orders,
    COUNT(cat_id) as orders_with_cats,
    COUNT(cat_id) * 100.0 / NULLIF(COUNT(*), 0) as pct_with_cats
FROM orders;
-- NULLIF(COUNT(*), 0) returns NULL if COUNT(*) is 0, avoiding division by zero

Exercises

Exercise 1

Given these results, explain the difference:

SELECT
    COUNT(*) as a,
    COUNT(DISTINCT status) as b,
    COUNT(status) as c
FROM orders;
-- Assume: 100 rows, 3 distinct statuses, 5 rows with NULL status

Exercise 2

Write a query to find the number of unique customers who:

  • Placed more than 5 orders
  • Ordered more than 3 different cats

Exercise 3

Write a query showing "cat popularity" using COUNT(DISTINCT customer_id) per cat, not COUNT(*).

Exercise 4

Explain why these give different results:

SELECT COUNT(DISTINCT city) FROM customers;
SELECT COUNT(DISTINCT customer_id) FROM customers WHERE city IS NOT NULL;

Exercise 5

Write a query to find what percentage of customers have made more than one purchase.


Key Takeaways

  • COUNT(*) counts all rows including NULLs
  • COUNT(column) counts only non-NULL values in that column
  • COUNT(DISTINCT column) counts unique non-NULL values
  • Use COUNT(DISTINCT ...) when you need unique counts (customers, products, etc.)
  • Use NULLIF to avoid division by zero in percentage calculations
  • Be intentional: COUNT(*) vs COUNT(column) can give very different results