Link: https://www.youtube.com/watch?v=HubezKbFL7E
1. What is a Database Index?
- Primary Purpose: An index is a performance tool designed to make reading data faster. Since slow queries are the leading cause of poor application performance, indexing is the most effective way to optimize.
- The “Phone Book” Analogy: An index is like a phone book ordered by last name. Because the data is sorted, you can find a specific entry quickly without reading every page.
- The Power of Order: Searching ordered data is exponentially more efficient than searching unordered data. It allows the database to use algorithms similar to a binary search to locate rows.
2. The B-Tree Data Structure
Most database indexes use a B-Tree (Balanced Tree) structure to maintain order and ensure predictable performance.
Structural Components
- Root Node: The entry point for every search.
- Sub-trees: Nodes are organized logically; values less than the parent go to the left, and values greater or equal go to the right.
- Leaf Nodes: The bottom layer where the actual indexed values are stored.
- Doubly Linked List: Leaf nodes are connected to each other. This allows the database to perform “range scans” (e.g., finding all IDs between 10 and 20) by finding the first value and then simply following the links to the next ones.
What is actually inside an index?
An index does not copy the whole table. It contains:
- The Indexed Column(s): Only the specific data you chose to index.
- The Row ID (Pointer): A reference that tells the database exactly where the rest of the row’s data is located on the disk.
Key Benefits
- Logarithmic Scalability: As your data grows from 1,000 to 1,000,000 rows, the depth of the tree only increases slightly, keeping performance stable.
3. Trade-offs: The Cost of Indexing
- Read vs. Write: While indexes speed up reads, they slow down writes (
INSERT,UPDATE,DELETE). - Maintenance: Every time data changes, the database must also update the B-Tree and potentially “rebalance” it to keep the leaf nodes at the same level.
- Rule of Thumb: Use as many indexes as necessary to support your queries, but as few as possible to avoid bloating write times and storage.
4. Analyzing the Execution Plan (EXPLAIN)
To see if your index is working, prepend EXPLAIN to your SQL query. The type (Access Type) column tells you how the database is accessing the data:
| Access Type | Description | Performance |
|---|---|---|
| const / eq_ref | Finds a single value using a unique index/primary key. | Excellent (Stop optimizing here) |
| ref / range | Finds a starting point and scans a portion of the leaf nodes. | Good (Limits rows inspected) |
| index | Full Index Scan. Scans every leaf node in the index. | Poor (Better than table scan, but slow) |
| ALL | Full Table Scan. Reads every row and column from disk. | Worst (Avoid at all costs) |
5. Common Pitfalls and Solutions
Pitfall 1: Functions in WHERE Clauses
- Problem:
WHERE YEAR(created_at) = 2013prevents the index from being used because the database doesn’t know the result of the function until it runs it on every row. - Solution: Use raw ranges:
WHERE created_at BETWEEN '2013-01-01' AND '2013-12-31'.
Pitfall 2: Missing Data (The “Hidden” Table Hit)
- Problem: If you
SELECT totalbut your index only coverscreated_at, the database must jump back to the main table for every single row to find the “total” value. - Solution: Index-Only Scan. Add the requested column to the index. If the index contains all the data the query needs, the database never has to touch the main table.
Pitfall 3: Incorrect Column Order (Left-to-Right Rule)
- Problem: In a multi-column index
(A, B, C), you can search forAorA and B, but you cannot efficiently search forB and Calone. The index is sorted by the first column primarily. - Solution: Place the most selective columns (the ones that filter out the most data) first.
Pitfall 4: The Inequality “Wall”
- Problem: Once you use a range operator (
>,<,BETWEEN) on a column in a multi-column index, the database cannot use the subsequent columns in that index for filtering. - Solution: Always place columns used with equality (
=) first in your index, and the range column last.
6. The Developer’s Mandate
- Indexes are Query-Specific: You don’t “index a table”; you “index a query.” An index’s design depends entirely on the
WHERE,SELECT, andORDER BYclauses of a specific query. - Ownership: Developers—not just DBAs—must own indexing because only the developer understands the application’s data access patterns and which performance trade-offs are acceptable.