On Database Indexing

Last Thursday at 2:47 AM, I watched my perfectly reasonable query scan through 2.3 million rows to find exactly 12 records.

Twelve.

The query was simple: SELECT * FROM orders WHERE status = 'pending' AND created_at >= '2024-01-01'. Nothing fancy. But there I was, staring at PostgreSQL doing the database equivalent of reading every book in a library to find the one with "penguin" in the title.

That's when it hit me. I'd been thinking about indexes all wrong.

Most of us learn indexing like driving: primary key here, index frequently queried columns there, done. But real indexing? The kind that actually works? That's different. It's about understanding the dance between query planners and B-trees.

The Brutal Math of Selectivity

Here's what nobody tells you: your database is constantly doing math, calculating whether to use your index or just ignore it.

Rude, right?

Take a users table with a boolean is_pro column. If 95% of your million users are freeloaders (been there), then WHERE is_pro = false returns 950,000 rows. Selectivity of 0.95.

Terrible.

But WHERE is_pro = true? Now you're looking at 50,000 rows. Selectivity of 0.05. Same index, same column. But now your query planner can eliminate 95% of the data before even thinking about heap lookups.

The same index can be your best friend or completely useless.

Actually Measuring This Thing

How do we figure out if our index is any good? Math time.

-- Check cardinality (distinct values)
SELECT COUNT(DISTINCT birthday) FROM users;

-- Calculate selectivity
SELECT
  COUNT(DISTINCT birthday)::decimal / COUNT(*) as selectivity
FROM users;

High selectivity (close to 1.0) means your index cuts through data like a hot knife through butter. Your primary key? Perfect selectivity. Every lookup returns exactly one row.

But here's where it gets weird: sometimes terrible global selectivity hides amazing local selectivity. Let me show you what I mean.

When Low Cardinality Works (Plot Twist!)

Database indexing gets counterintuitive here. Boolean columns—those terrible two-value wasteland columns—can actually be amazing indexes. But only if your data is dramatically skewed.

-- This might use an index despite being boolean
SELECT * FROM users WHERE is_pro = true;  -- Returns 5% of rows

-- This probably will not
SELECT * FROM users WHERE is_pro = false; -- Returns 95% of rows

Your query planner runs a cost-benefit analysis: "Do I traverse this index and hop to the heap? Or just read the whole table since I'm touching most of it anyway?"

When the math says "you're returning 95% of the table," sequential scan wins. It's not lazy—it's smart.

So far we've been talking about single-column indexes. But what happens when your queries need to filter on multiple columns at once? That's where things get really interesting.

The Art of Composite Indexes

Single-column indexes are like using a screwdriver for everything. Composite indexes are where things get really interesting.

Think of them as multi-layered filters. First you filter by column A, then within those results you filter by column B, then by column C. But the order matters. A lot.

The leftmost prefix rule is simple and ruthlessly unforgiving:

An index on (first_name, last_name, birthday) supports:

WHERE first_name = ? ✅
WHERE first_name = ? AND last_name = ? ✅
WHERE first_name = ? AND last_name = ? AND birthday = ? ✅

But not WHERE last_name = ? alone. ❌

Why? B-trees are ordered structures. Like a phone book sorted by last name, then first name. You can't efficiently look up someone by their first name if the book is sorted by last name first.

The Sacred Order

Most people mess up here. They think alphabetically, or by importance, or by whatever cosmic force guides decisions at 3 PM on a Friday.

Wrong.

Column order is about query patterns. Here's what actually matters:

Equality predicates first - Your WHERE column = ? conditions
Range predicates next - Your WHERE column > ? and BETWEEN friends
ORDER BY columns last - Because avoiding sorts is beautiful

But here's a crucial detail: within equality predicates, put your lowest cardinality columns first. If you always filter by tenant_id (low cardinality) and user_id (high cardinality), put tenant_id first. This lets the index eliminate huge chunks of data early, even though user_id might be more selective per query.

Say you have this query haunting your slow query log:

SELECT * FROM events
WHERE tenant_id = ? AND status = 'active'
ORDER BY created_at DESC
LIMIT 50;

The optimal index? (tenant_id, status, created_at DESC)

We pinpoint exact tenant, exact status, and get pre-sorted results. No additional sorting. No extra CPU cycles. Just efficient data retrieval.

The Great Debate: One Super Index vs Multiple Clever Ones

PostgreSQL has this neat trick where it combines multiple single-column indexes:

-- With separate indexes on first_name and last_name
SELECT * FROM users
WHERE first_name = 'John' AND last_name = 'Doe';

The planner scans the first_name index, gets row addresses. Scans the last_name index, gets more addresses. Does a bitmap intersection—finds the overlap—then fetches those rows.

Clever? Absolutely. Optimal? Usually not.

A composite index (first_name, last_name) typically outperforms this approach. But sometimes you want multiple indexes for OR conditions or unpredictable query patterns.

Covering Indexes: The Holy Grail

Sometimes the database gods smile and you achieve an "index-only scan."

-- Index: (first_name, last_name)
SELECT first_name, last_name FROM users WHERE first_name = 'John';

Your query doesn't need that expensive round trip to the heap. Everything it needs is in the index. Like ordering takeout and having it appear without the delivery person leaving the restaurant.

But what if you need one more column that's not part of your filtering logic? PostgreSQL's INCLUDE clause lets you add hitchhiker columns—data that comes along but doesn't participate in indexing:

CREATE INDEX ON users (first_name, last_name) INCLUDE (email);

Now you can run SELECT first_name, last_name, email FROM users WHERE first_name = 'John' and get an index-only scan. Fast lookups on names, plus email available without any heap lookups.

But what if most of your data is noise? What if you only care about a small, specific subset of your table? There's an elegant solution for that too.

Partial Indexes: The Minimalist's Dream

What if you could create an index that ignores rows you don't care about? Like a VIP list for people who matter.

Partial indexes do exactly that—index only rows matching a condition. Smaller storage, faster maintenance, better selectivity. Only keep what you actually need.

Perfect for common patterns:

-- Only index active users (forget the deleted ones)
CREATE INDEX ON users (email) WHERE deleted_at IS NULL;

-- Only index recent orders (who cares about 2019?)
CREATE INDEX ON orders (status, created_at DESC)
WHERE created_at >= '2024-01-01';

The crucial part: your queries must include predicates that match the index condition. The planner isn't psychic—if your query doesn't filter for active users, it can't use your partial index.

Like having a VIP room key but forgetting to show it at the door.

Unique Constraints on Subsets

Partial indexes enable unique constraints over row subsets:

-- Allow multiple deleted accounts with same email
-- but only one active account per email
CREATE UNIQUE INDEX ON users (email) WHERE deleted_at IS NULL;

This supports soft-delete patterns while maintaining uniqueness for active records.

Now, there's one more piece of the puzzle. Even with all this selectivity and partial indexing magic, you still need to think about how your data comes back. Order matters—literally.

Index Ordering and Sorting

Index key ordering affects performance, especially for ORDER BY queries.

Single columns are flexible—PostgreSQL can scan B-trees forward or backward:

-- Both use the same index
SELECT * FROM posts ORDER BY created_at ASC;
SELECT * FROM posts ORDER BY created_at DESC;

Multi-column ordering requires more care:

-- Index: (tenant_id, created_at DESC)
SELECT * FROM events
WHERE tenant_id = ?
ORDER BY created_at DESC; -- Fast!

SELECT * FROM events
WHERE tenant_id = ?
ORDER BY created_at ASC;  -- Still fast (backward scan)

-- But mixed directions need explicit handling
SELECT * FROM events
ORDER BY tenant_id ASC, created_at DESC; -- Index scan

SELECT * FROM events
ORDER BY tenant_id DESC, created_at ASC;  -- Needs different index or sort

NULL Handling

Control NULL placement to match query patterns:

CREATE INDEX ON users (created_at DESC NULLS LAST);

-- Must match in queries for index-order scan
SELECT * FROM users ORDER BY created_at DESC NULLS LAST LIMIT 10;

Let's put all this theory into practice. Here are some real-world patterns you'll encounter and how to handle them.

Practical Patterns

Feed Pagination

-- Query
SELECT * FROM posts
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 20;

-- Index
CREATE INDEX ON posts (user_id, created_at DESC);

Status-Filtered Recent Items

-- Query
SELECT * FROM orders
WHERE status IN ('pending', 'processing')
  AND created_at >= NOW() - INTERVAL '7 days'
ORDER BY created_at DESC;

-- Index (partial for efficiency)
CREATE INDEX ON orders (status, created_at DESC)
WHERE status IN ('pending', 'processing', 'shipped');

Case-Insensitive Search

-- Expression index
CREATE INDEX ON users ((lower(email)));

-- Query (must match expression)
SELECT * FROM users WHERE lower(email) = lower(?);

All this theory is useless if you can't tell whether your indexes are actually working. Here's how to validate your assumptions.

Diagnostics and Validation

Always validate with EXPLAIN (ANALYZE, BUFFERS):

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events
WHERE tenant_id = 123 AND created_at >= '2024-01-01';

Look for:

Estimated vs actual rows - Large differences suggest stale statistics
Index usage - Seq Scan when you expected Index Scan
Buffer hits - High shared_hit indicates good cache utilization
Sort operations - Often eliminable with proper index ordering

But remember—indexes aren't free. They come with costs that you need to understand.

Trade-offs and Maintenance

Every index costs write performance. Each INSERT, UPDATE, and DELETE maintains all relevant indexes.

Monitor index usage:

SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan;

Unused indexes (low idx_scan) are pure overhead.

So how do you put this all together? Here's a simple framework for making indexing decisions.

Decision Framework

For each frequent query:

Identify predicates - equality vs range conditions
Order composite columns - equality first, then ranges, then ORDER BY
Consider partial indexes - for hot subsets with stable predicates
Match sort requirements - especially mixed ASC/DESC directions
Validate with EXPLAIN - confirm the planner uses your index effectively
Monitor usage - remove unused indexes

The goal isn't more indexes—it's the right indexes for your workload. A few well-designed composite and partial indexes typically outperform many single-column ones.

Database performance is about making the planner's job easy. Give it selective indexes that match your query patterns, and it will reward you with consistently fast queries regardless of data size.

Back at 2:47 AM with that same query, except now it found those 12 records in 0.03 milliseconds instead of scanning 2.3 million rows. The partial index, the composite ordering, the selectivity—it all clicked.

Sometimes the best part isn't writing new code. It's watching your database work efficiently because you finally understood the dance between data structures and query planners.