📚 Tổng quan
Chúng ta đã khám phá sâu về database indexing - từ cách B-Tree được xây dựng, các loại index khác nhau, đến strategies để tối ưu query performance. Đây là kiến thức nền tảng giúp thiết kế database hiệu quả và troubleshoot performance issues.
📋 Mục lục
1. 🔍 Index Fundamentals
Index là gì?
Index = Mục lục của database table
- Giống như mục lục trong sách, giúp tìm kiếm nhanh chóng
- Trade-off: Nhanh hơn khi đọc, chậm hơn khi ghi
- Lưu trữ: Cùng file với data (.ibd) nhưng khác pages
Mục đích chính:
-- Tăng tốc độ queries
SELECT * FROM students WHERE age = 20; -- 0.001s với index vs 0.5s không có -- Tăng tốc ORDER BY
SELECT * FROM students ORDER BY age; -- Instant với index -- Tăng tốc JOIN operations
SELECT s.name, c.course_name FROM students s JOIN courses c ON s.id = c.student_id; -- Enforce UNIQUE constraints
CREATE TABLE users (email VARCHAR(255) UNIQUE); -- Tự tạo index
Performance Impact:
📊 Typical improvements:
Point queries: 100-1000x faster
Range queries: 10-100x faster Sorting: 10-50x faster
Joins: 5-20x faster Cost: 20-30% storage overhead, slower writes
2. 🌳 B-Tree Structure & Construction
Tại sao B-Tree?
❌ Array: O(n) linear search - quá chậm
❌ Hash: O(1) point lookup nhưng không support range queries ✅ B-Tree: O(log n) + perfect range query support
Cấu trúc B-Tree:
🌳 B-Tree Example (age index): 📊 ROOT PAGE [22] ┌─────┴─────┐ / \ 📊 BRANCH PAGE 📊 BRANCH PAGE [19, 21] [25, 27] ┌─────┼─────┐ ┌─────┼─────┐ / | \ / | \
📋LEAF 📋LEAF 📋LEAF 📋LEAF 📋LEAF 📋LEAF
[17,18] [19,20] [21,22] [23,24] [25,26] [27,28] ↓ ↓ ↓ ↓ ↓ ↓
Point to actual data pages (Page_ID, Slot_Number)
Quá trình xây dựng Index:
Step 1: Data Collection
CREATE INDEX idx_age ON students(age); Behind the scenes:
1. 🔍 Scan toàn bộ table
2. 📊 Extract (age, pointer) pairs: [(20,ptr1), (18,ptr2), (22,ptr3)...]
3. 🔄 Sort by age: [(17,ptr6), (18,ptr2), (19,ptr4), (20,ptr1)...]
Step 2: Build Tree (Bottom-Up)
📋 Leaf Level: Tạo sorted leaf pages với capacity ~1,300 entries/page
📊 Branch Level: Tạo internal nodes để navigate
📊 Root Level: Single root node pointing to branches Final result: Balanced tree với height = ⌈log₁₃₃₃(records)⌉
Row Locators (Pointers):
Pointer Structure: (Page_ID, Slot_Number, Record_Length)
Example: ptr1 = (Page_51, Slot_3, 128_bytes) 📄 Data Page 51:
┌─ Slot Directory ─────────────────┐
│ Slot 1 → Offset: 128 │
│ Slot 2 → Offset: 200 │ │ Slot 3 → Offset: 272 ← ptr1 │
│ Slot 4 → Offset: 344 │
├─ Records Data ──────────────────┤
│ @Offset 272: Alice's record │ ← 128 bytes
└──────────────────────────────────┘
3. 🔑 Primary Key vs Secondary Index
Clustered Index (PRIMARY KEY):
Đặc điểm:
- Data IS the index - leaf pages chứa actual records
- Physical ordering - data pages sorted theo PRIMARY KEY
- One per table - chỉ có 1 clustered index
- Foundation - tất cả secondary indexes point về PRIMARY KEY
Cấu trúc:
🌳 PRIMARY KEY B-Tree: [ROOT: id ranges] / \ [BRANCH: id≤1000] [BRANCH: id>1000] / | \ / | \ 📄DATA 📄DATA 📄DATA 📄DATA 📄DATA 📄DATA Page51 Page52 Page53 Page54 Page55 Page56 📄 Data Page 51 (actual records, sorted by id):
[id:1,name:"Alice",age:20][id:2,name:"Bob",age:18][id:3,name:"Charlie",age:22] = No separate pointer needed!
Secondary Index:
Đặc điểm:
- Pointer system - leaf pages chứa references đến PRIMARY KEY
- Multiple allowed - có thể có nhiều secondary indexes
- Double lookup - secondary → primary → data
Cấu trúc:
🌿 Secondary Index (age):
📋 Leaf Pages:
[age:17→id:5][age:18→id:2][age:19→id:4][age:20→id:1][age:21→id:3] ↓ ↓ ↓ ↓ ↓
Points to PRIMARY KEY values, not direct page locations Query execution:
1. age=20 → finds id=1 from secondary index
2. id=1 → lookup in clustered index for actual data
Có vs Không có PRIMARY KEY:
✅ Có PRIMARY KEY:
CREATE TABLE students (id INT PRIMARY KEY, age INT);
CREATE INDEX idx_age ON students(age); Secondary index: age(4B) + id(4B) = 8B per entry ❌ Không có PRIMARY KEY: CREATE TABLE students_no_pk (student_id INT, age INT);
CREATE INDEX idx_age ON students_no_pk(age); MySQL tự tạo hidden _row_id (6 bytes):
Secondary index: age(4B) + _row_id(6B) = 10B per entry
= 25% storage overhead! Recommendation: LUÔN có PRIMARY KEY!
4. 🔗 Composite Indexes & Leftmost Prefix
Composite Index Structure:
CREATE INDEX idx_age_gpa ON students(age, gpa); 🌳 Composite B-Tree:
Leaf level sorted hierarchically:
[(20,3.2)→id:2][(20,3.5)→id:1][(20,3.5)→id:4][(21,3.2)→id:5][(21,3.5)→id:3] ↑ ↑ ↑ ↑ ↑ age first, then gpa within same age
Leftmost Prefix Rule:
-- Index: (age, gpa, major)
CREATE INDEX idx_triple ON students(age, gpa, major); ✅ Can use index efficiently:
WHERE age = 20 -- Uses (age) prefix
WHERE age = 20 AND gpa = 3.5 -- Uses (age, gpa) prefix
WHERE age = 20 AND gpa = 3.5 AND major = 'CS' -- Uses full index
WHERE age BETWEEN 20 AND 22 -- Range on first column ❌ Cannot use index efficiently:
WHERE gpa = 3.5 -- Skips first column!
WHERE major = 'CS' -- Skips first columns!
WHERE gpa = 3.5 AND major = 'CS' -- Skips first column!
WHERE age = 20 AND major = 'CS' -- Gap in middle (partial use)
Column Order Strategy:
-- Query pattern analysis:
-- 30% queries: WHERE age = ?
-- 40% queries: WHERE age = ? AND gpa >= ?
-- 10% queries: WHERE age = ? AND gpa >= ? AND major = ? -- Optimal design:
CREATE INDEX idx_age_gpa_major ON students(age, gpa, major);
-- Covers 80% of query patterns efficiently! -- Wrong design:
CREATE INDEX idx_major_gpa_age ON students(major, gpa, age); -- Only covers 10% efficiently!
5. 🌟 Clustered vs Covering Indexes
Clustered Index:
Concept:
Clustered = Data pages physically ordered by index key
InnoDB: PRIMARY KEY is always clustered
Benefits:
-- Range queries super fast (sequential I/O):
SELECT * FROM orders WHERE id BETWEEN 1000 AND 2000; -- ORDER BY primary key is free:
SELECT * FROM orders WHERE customer_id = 123 ORDER BY id; -- Count queries optimized:
SELECT COUNT(*) FROM orders WHERE id < 1000;
Best Practices:
✅ Good PRIMARY KEY:
id BIGINT AUTO_INCREMENT PRIMARY KEY -- Sequential, narrow ❌ Poor PRIMARY KEY: uuid CHAR(36) PRIMARY KEY -- Random, wide
(customer_id, order_date) PRIMARY KEY -- Too wide
Covering Index:
Concept:
Covering Index = Index contains ALL columns needed for query
→ No data page access required!
Example:
-- Frequent query:
SELECT id, name, age FROM students WHERE major = 'CS' ORDER BY gpa DESC; ❌ Regular index:
CREATE INDEX idx_major ON students(major);
Execution: Index lookup → Data page access for name, age, gpa ✅ Covering index:
CREATE INDEX idx_major_gpa_covering ON students(major, gpa DESC, name, age);
Execution: Index lookup only → All data available in index! Performance: 2-5x faster!
Design Pattern:
-- Column order for covering index:
CREATE INDEX idx_covering ON table( where_column1, -- WHERE clause columns first where_column2, order_column DESC, -- ORDER BY columns next select_column1, -- SELECT columns last select_column2
); -- Primary key (id) always available automatically
6. 🎯 Index Design Strategies
Query Pattern Analysis:
Step 1: Identify Common Patterns
-- Use performance_schema to find frequent queries
SELECT digest_text, count_star, avg_timer_wait/1000000000 AS avg_seconds
FROM performance_schema.events_statements_summary_by_digest WHERE digest_text LIKE '%your_table%'
ORDER BY count_star DESC;
Step 2: Design Index Strategy
-- Pattern examples:
-- 40% queries: WHERE status = ? ORDER BY created_date DESC
-- 30% queries: WHERE user_id = ? AND status = ?
-- 20% queries: WHERE created_date >= ?
-- 10% queries: Complex multi-column filters -- Optimal indexes:
CREATE INDEX idx_status_date ON table(status, created_date DESC); -- 40%
CREATE INDEX idx_user_status ON table(user_id, status); -- 30%
CREATE INDEX idx_created_date ON table(created_date); -- 20%
-- 10% can use combinations of above indexes
Advanced Patterns:
Partial Indexes:
-- Index only active records
CREATE INDEX idx_active_users ON users(last_login) WHERE status = 'active'; -- Index only recent data
CREATE INDEX idx_recent_orders ON orders(created_date) WHERE created_date >= '2024-01-01';
Expression Indexes:
-- Index on computed values
CREATE INDEX idx_full_name ON users(CONCAT(first_name, ' ', last_name)); -- Case-insensitive search
CREATE INDEX idx_email_lower ON users(LOWER(email));
7. 📊 Performance Analysis
B-Tree Height Impact:
📈 Tree Height vs Performance:
Records | Tree Height | Max I/O Ops
1,000 | 1 | 1-2
100,000 | 2 | 2-3 10,000,000 | 3 | 3-4
1,000,000,000| 4 | 4-5 Formula: height ≈ ⌈log₁₃₃₃(records)⌉
Query Performance Comparison:
-- Benchmark: 1M records table Query: SELECT * FROM products WHERE category_id = 5 AND price > 100; ❌ No index (Full table scan):
- I/O operations: 15,000+ (scan all pages) - Time: 2-5 seconds
- CPU: High (filter every record) ✅ Single index on category_id:
- I/O operations: ~500 (scan matching categories, filter price)
- Time: 100-200ms
- CPU: Medium (filter price for category matches) ✅ Composite index (category_id, price): - I/O operations: ~50 (direct range scan)
- Time: 10-20ms - CPU: Low (pre-filtered results) ✅ Covering index (category_id, price, name, description):
- I/O operations: ~25 (index-only scan)
- Time: 5-10ms
- CPU: Minimal Performance scaling: 500x improvement từ no-index đến covering index!
Write Performance Impact:
-- Table với multiple indexes:
CREATE TABLE products ( id INT PRIMARY KEY, name VARCHAR(100), category_id INT, price DECIMAL(10,2), created_date DATE, INDEX idx_category (category_id), -- Index #2 INDEX idx_price (price), -- Index #3 INDEX idx_date (created_date), -- Index #4 INDEX idx_composite (category_id, price) -- Index #5
); INSERT performance impact:
- No indexes: 1,000 inserts/second
- 5 indexes: ~200 inserts/second (5x slower)
- Storage overhead: ~40% additional space Trade-off: Faster reads vs slower writes
8. 🛠️ Best Practices
Index Design Rules:
✅ DO:
-- Analyze query patterns first
-- Create indexes for WHERE, ORDER BY, JOIN columns
-- Use composite indexes for multi-column filters
-- Consider covering indexes for hot queries
-- Put most frequently filtered columns first in composite indexes
-- Keep PRIMARY KEY narrow and sequential
-- Monitor index usage and remove unused indexes ❌ DON'T:
-- Create indexes on every column "just in case"
-- Ignore the leftmost prefix rule
-- Use random UUIDs as PRIMARY KEY
-- Create redundant indexes
-- Forget about write performance impact
-- Create covering indexes for rarely used queries
Performance Optimization Checklist:
🔍 For each slow query:
1. Check if WHERE columns have indexes
2. Verify optimal composite index column order 3. Consider covering index opportunity
4. Analyze EXPLAIN output for full scans
5. Monitor index hit ratios
6. Look for unused or duplicate indexes 🎯 For each new feature:
1. Identify query patterns early
2. Design indexes before going to production
3. Test with realistic data volumes
4. Monitor performance metrics
5. Plan for data growth (index maintenance)
Common Anti-Patterns:
❌ Over-indexing:
-- Too many indexes per table (>5-7 is usually excessive)
CREATE INDEX idx1 ON products(name);
CREATE INDEX idx2 ON products(category_id); CREATE INDEX idx3 ON products(price);
CREATE INDEX idx4 ON products(created_date);
CREATE INDEX idx5 ON products(updated_date);
CREATE INDEX idx6 ON products(status);
-- Write performance will suffer! ❌ Wrong column order:
-- Query: WHERE age = 20 AND name LIKE 'John%'
CREATE INDEX idx_name_age ON users(name, age); -- Wrong!
-- Should be: (age, name) because age is more selective ❌ Redundant indexes:
CREATE INDEX idx_age ON students(age);
CREATE INDEX idx_age_gpa ON students(age, gpa); -- idx_age is redundant! idx_age_gpa covers age queries too.
🎉 Key Takeaways
Core Concepts:
-
B-Tree = Balanced Search Structure
- O(log n) search performance
- Supports both point and range queries
- Self-balancing for consistent performance
-
Index Types Matter
- Clustered (PRIMARY KEY): Data IS the index
- Secondary: Pointers to clustered index
- Composite: Multi-column indexes với leftmost prefix rule
- Covering: All needed columns in index
-
Design Principles
- Query patterns drive index design
- Column order matters in composite indexes
- Trade-off between read speed vs write speed
- Monitor usage và remove unused indexes
Performance Guidelines:
🚀 Typical Performance Gains:
- Point queries: 100-1000x faster với proper index
- Range queries: 10-100x faster - Covering indexes: 2-10x faster than regular indexes
- Clustered range scans: 10-100x faster than random I/O 💰 Costs:
- Storage: 20-40% overhead for indexes
- Write performance: 2-5x slower với many indexes
- Memory: Index pages compete với data pages in buffer pool
When to Use What:
🌳 Clustered Index: Always design good PRIMARY KEY
📋 Single Index: Simple WHERE conditions
🔗 Composite Index: Multi-column WHERE, complex queries 📊 Covering Index: Frequent queries với predictable columns
🎯 Partial Index: Filtered datasets, conditional logic
💡 Final Insights
Database indexing là nghệ thuật của việc balance giữa read performance và write cost.
Từ việc hiểu cách B-Tree hoạt động ở mức byte, đến thiết kế index strategies cho application - mọi quyết định đều ảnh hưởng đến user experience và system scalability.
Remember:
- ✅ Index design follows query patterns, không phải table structure
- ✅ Composite indexes powerful nhưng cần hiểu leftmost prefix rule
- ✅ Covering indexes eliminate I/O nhưng increase storage
- ✅ Clustered index foundation cho tất cả performance optimizations
- ✅ Monitor và maintain indexes là ongoing process