Quick Reference
Technique | Impact | Effort | Priority | When to Use |
---|
Predicate Pushdown | 10x+ | Medium | High | Selective queries |
Row Group Pruning | 5-10x | Low | High | Columnar storage |
Result Caching | 100-1000x | Low | High | Repeated queries |
Async I/O | 2-3x | Medium | Medium | Multi-partition reads |
Database Indexes | 10-100x | Low | High | Point queries |
SIMD | Variable | High | Low | Hot CPU paths |
Database optimization follows a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late.
This guide covers six optimization techniques with performance impact, code examples, and when to apply each. Part 2 covers a real case study replacing DuckDB.
Predicate Pushdown
Predicate pushdown filters data at the storage layer instead of in memory. For columnar databases, this reduces I/O by 70-90% for selective queries.
The technique uses two levels: row group pruning (skip entire chunks based on statistics) and record batch filtering (filter before deserializing). Both levels work together to minimize data movement.
Filter Selectivity | Data Kept | I/O Reduction | Expected Speedup |
---|
High (1% match) | 1% | 90-95% | 10-20x |
Medium (30% match) | 30% | 60-70% | 3-5x |
Low (70% match) | 70% | 20-30% | 1.5-2x |
Implementation
pub fn filter_to_arrow_predicate(
batch: &RecordBatch,
filter: &QueryFilter,
) -> Result<BooleanArray> {
match filter {
QueryFilter::MinPrice(threshold_i64) => {
let threshold = *threshold_i64 as f64 / 10000.0;
let close_array = batch.column(4)
.as_any()
.downcast_ref::<Float64Array>()?;
let threshold_array = Float64Array::from(
vec![threshold; close_array.len()]
);
// Arrow compute kernel - SIMD accelerated
gt_eq(close_array, &threshold_array)
}
}
}
The code translates query filters to Arrow compute predicates. Arrowβs gt_eq
kernel uses SIMD instructions automatically. No manual vectorization required.
When to Use
Use predicate pushdown when:
- Storage format is columnar (Parquet, Arrow, ORC)
- Queries include WHERE clauses
- Filters are selective (high cardinality columns)
- I/O is the bottleneck
Donβt use when:
- Queries scan full tables
- Storage format lacks statistics
- Filters have low selectivity (>80% of rows match)
Trade-offs
Predicate pushdown requires storage statistics. Parquet and ORC write min/max stats by default. Legacy formats may not support this.
False positives are possible. A row group with range [440, 460] will be read for filter βclose >= 450β even if all values are 440-449. This wastes I/O but doesnβt return incorrect results.
Implementation complexity is medium. Requires understanding of storage format internals and Arrow compute kernels.
Row Group Pruning
Row group pruning skips entire data chunks based on metadata statistics. For a query with filter βclose >= 450β, any row group with max_close < 450 can be skipped entirely.
This optimization is specific to formats that partition data into groups (Parquet row groups, ORC stripes). Metadata is small (~10KB) compared to data (~1MB per group).
How It Works
Parquet files contain metadata with statistics per row group:
Parquet File
βββ Metadata (10KB)
β βββ Row Group 0: {min_close: 440.0, max_close: 445.0}
β βββ Row Group 1: {min_close: 445.0, max_close: 450.0}
β βββ Row Group 2: {min_close: 450.0, max_close: 455.0}
βββ Row Group 0 (100KB compressed)
βββ Row Group 1 (100KB compressed)
βββ Row Group 2 (100KB compressed)
For filter βclose >= 450β, the pruning logic is:
Row Group | min_close | max_close | Action | Reason |
---|
0 | 440.0 | 445.0 | SKIP | max < 450 (all rows fail) |
1 | 445.0 | 450.0 | READ | overlaps range |
2 | 450.0 | 455.0 | READ | all rows match |
Row Group 0 is never read from disk. I/O saved: 33%.
Implementation
fn filter_matches_row_group(
row_group: &RowGroupMetaData,
filter: &QueryFilter,
) -> bool {
match filter {
QueryFilter::MinPrice(threshold_i64) => {
let threshold = *threshold_i64 as f64 / 10000.0;
if let Some(stats) = row_group.column(4).statistics() {
if let Some(max_bytes) = stats.max_bytes_opt() {
if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
// Skip if max < threshold
return max_close >= threshold;
}
}
}
// No stats = assume match (safe but not optimal)
true
}
}
}
Statistics parsing uses little-endian byte order. If statistics are missing, the function returns true to avoid skipping data incorrectly.
When to Use
Use row group pruning when:
- Storage format has per-group statistics (Parquet, ORC)
- Data is sorted or clustered by filter columns
- Row groups are large enough (>10K rows) for overhead to be worthwhile
Effectiveness depends on data distribution. If prices are randomly distributed across row groups, pruning helps less. If data is time-ordered and prices trend, pruning helps more.
Result Caching
Result caching stores query results in memory. For repeated queries, this eliminates all I/O and computation. Cache hit latency is 50-100x faster than disk read.
The cache key must include everything that affects results: symbol, time range, filters, aggregation. Hash the full query object for the key.
Metric | Without Cache | With Cache (80% hit) | Speedup |
---|
P50 latency | 1.5ms | 0.3ms | 5x |
P99 latency | 5ms | 0.4ms | 12.5x |
Throughput | 667 qps | 3,333 qps | 5x |
Cache hit rate determines overall speedup. 80% hit rate is typical for dashboard workloads. Interactive analytics often exceeds 90%.
Implementation
// Check cache first
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
let execution_time_us = start.elapsed().as_micros() as u64;
return Ok(QueryResult {
bars: cached_bars.clone(),
execution_time_us,
from_cache: true,
});
}
// Execute query
let result = storage.read_with_filters(...)?;
// Cache result
self.cache.borrow_mut().put(query.clone(), result.clone());
The implementation uses LRU eviction. Least recently used entries are evicted when cache is full. TTL (time-to-live) can be added for data that changes frequently.
When to Use
Use caching when:
- Queries repeat (dashboards, APIs)
- Data is read-heavy (10:1 read:write ratio or higher)
- Results fit in memory
- Stale data is acceptable for short periods
Donβt use when:
- Every query is unique
- Data changes frequently (high write rate)
- Memory is limited
- Consistency is critical
Cache Invalidation
Cache invalidation is the hard part. Three strategies:
Strategy | Accuracy | Complexity | Use Case |
---|
TTL (time-based) | Low | Low | Infrequent writes |
Write-through | High | Medium | Moderate write rate |
Version stamps | High | High | Complex dependencies |
TTL invalidation expires entries after N seconds. Simple but may serve stale data.
Write-through invalidation clears affected entries on write. Requires tracking which queries touch which data.
Version stamps attach a version number to data. Cache entries include the version. Increment version on write.
Async I/O for Multi-Partition Reads
Async I/O parallelizes reads from multiple partitions. For time-series databases partitioned by day, a query spanning 5 days can read all 5 partitions concurrently.
The speedup depends on storage parallelism. SSDs support high queue depth. HDDs benefit less from parallelism.
Speedup Analysis
Partitions | Sequential | Parallel | Speedup |
---|
1 | 1.0ms | 1.0ms | 1.0x |
2 | 2.0ms | 1.1ms | 1.8x |
5 | 5.0ms | 1.8ms | 2.8x |
10 | 10.0ms | 3.2ms | 3.1x |
Speedup saturates around 3x. Synchronization overhead and I/O contention limit scaling. Network-attached storage may scale better than local disks.
Implementation
// Launch concurrent reads
let futures: Vec<_> = partitions
.iter()
.map(|partition| {
let storage = storage.clone();
let symbol = symbol.to_string();
let partition = partition.to_string();
async move {
storage.read_async(&symbol, &partition).await
}
})
.collect();
// Wait for all reads
let results = futures::future::join_all(futures).await;
// Merge results
let mut all_bars = Vec::new();
for result in results {
all_bars.extend(result?);
}
The code uses tokio::task::spawn_blocking
to avoid blocking async runtime. Each partition read happens in a thread pool.
When to Use
Use async I/O when:
- Queries span multiple partitions
- Storage supports parallelism (SSD, distributed storage)
- Partitions are independent (no cross-partition dependencies)
- Latency matters more than CPU usage
Donβt use when:
- Single partition queries dominate
- Storage is slow (HDD) or saturated
- Thread overhead exceeds I/O time
Async I/O adds complexity. For single-partition queries, synchronous reads are simpler and faster.
Database Indexes
Indexes are lookup structures that accelerate queries. B-tree indexes support range queries. Hash indexes support equality. Bitmap indexes support low-cardinality columns.
The trade-off is write speed vs read speed. Every insert must update all indexes. For write-heavy workloads, indexes may slow overall throughput.
Index Type Comparison
Index Type | Best For | Write Overhead | Read Speedup |
---|
B-tree | Range queries | 20-30% | 10-100x |
Hash | Equality | 10-20% | 50-200x |
Bitmap | Low cardinality | 5-10% | 20-50x |
Covering | SELECT subset | 30-50% | 100-1000x |
B-tree indexes are general-purpose. They support <, >, =, BETWEEN. Write overhead is moderate.
Hash indexes are faster for exact match. They donβt support range queries. Write overhead is lower than B-tree.
Bitmap indexes compress well for columns with few distinct values (country, status, category). They support AND/OR operations efficiently.
Covering indexes include all columns in SELECT clause. Queries never touch base table. Write overhead is high because index duplicates data.
When to Index
Index when:
- Query is slow due to full table scan
- Filter selectivity is high (<10% of rows match)
- Table is large (>100K rows)
- Read:write ratio is high (>10:1)
Donβt index when:
- Table is small (<10K rows)
- Filter selectivity is low (>30% of rows match)
- Write rate is high (index maintenance overhead exceeds query savings)
Trade-offs
Indexes consume disk space. A B-tree index is ~50% of indexed column size. Multiple indexes multiply storage cost.
Indexes must be maintained. INSERT, UPDATE, DELETE operations update all indexes. For bulk loads, drop indexes, load data, rebuild indexes.
Index statistics must be current. Stale statistics lead to poor query plans. Run ANALYZE after significant data changes.
When SIMD Fails
SIMD (Single Instruction Multiple Data) processes 2-8 values per CPU instruction. Modern CPUs support AVX-2 (256-bit) and AVX-512 (512-bit) instructions.
SIMD requires specific data layout. Struct-of-Arrays (SoA) layout works. Array-of-Structs (AoS) layout doesnβt.
Data Layout Compatibility
Data Layout | SIMD Compatible | Why |
---|
Struct-of-Arrays | β
Yes | Contiguous fields |
Array-of-Structs | β No | Scattered fields |
Columnar (Arrow) | β
Yes | Already columnar |
For time-series data stored as Vec<OHLCV>
, each OHLCV is a struct:
struct OHLCV {
timestamp: i64,
open: f64,
high: f64,
low: f64,
close: f64,
volume: u64,
}
This is AoS layout. To SIMD-process the close
field, the code must extract every 5th element (skipping timestamp, open, high, low). Extraction overhead exceeds SIMD benefit.
Why It Failed
Attempted SIMD aggregation on AoS data resulted in 40-86% regression:
Operation | Scalar (ns) | SIMD (ns) | Result |
---|
VWAP | 1,340 | 2,486 | 0.54x (86% regression) |
Average | 412 | 687 | 0.60x (67% regression) |
Sum | 339 | 566 | 0.60x (67% regression) |
Field extraction dominated runtime. Scalar code accessed struct fields directly. SIMD code copied fields to temporary arrays.
The Fix
Use columnar storage (Parquet/Arrow). Data is already in SoA layout. Arrow compute kernels use SIMD automatically.
// Arrow compute - SIMD automatic
let close_array = batch.column(4)
.as_any()
.downcast_ref::<Float64Array>()?;
// gt_eq uses SIMD internally
let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;
No manual SIMD code required. Arrow handles it at the library level.
Lesson Learned
Data layout matters more than instruction optimization. Columnar storage enables SIMD without manual work. Attempting SIMD on AoS layout wastes time and regresses performance.
Optimization Decision Tree
Start by profiling. Find the bottleneck before optimizing.
Is query slow?
ββ Yes, I/O-bound?
β ββ Selective filters? β Predicate Pushdown (10x)
β ββ Multiple partitions? β Async I/O (3x)
β ββ Repeated queries? β Caching (100x+)
ββ Yes, CPU-bound?
ββ Hot aggregation path? β SIMD (if columnar)
ββ Complex computation? β Algorithm optimization
For I/O-bound queries, optimize I/O first. CPU optimization wonβt help.
For CPU-bound queries, verify data layout supports optimization. Donβt assume SIMD helps without measuring.
Priority Order
- Profile to find bottleneck (measure, donβt guess)
- Cache if queries repeat (100-1000x gain, low effort)
- Predicate pushdown if filters are selective (10x gain, medium effort)
- Async I/O if multi-partition (3x gain, medium effort)
- Indexes if point queries (10-100x gain, low effort)
- Only then consider CPU optimizations
The highest-leverage optimizations eliminate work entirely. Caching eliminates all work. Predicate pushdown eliminates 90% of I/O. CPU optimization makes remaining work 2x faster.
Optimize to eliminate work before optimizing to do work faster.
Summary
Database optimization follows a hierarchy: eliminate work, reduce I/O, then optimize CPU.
Predicate pushdown reduces I/O by filtering at storage layer. Row group pruning skips data chunks based on statistics. Result caching eliminates repeated computation. Async I/O parallelizes multi-partition reads. Indexes accelerate lookups. SIMD requires columnar data layout.
The highest gains come from I/O reduction, not CPU optimization. A 10x I/O reduction beats a 2x CPU speedup.
Part 2 covers a real case study: replacing DuckDB with Rust and predicate pushdown for 10.4x speedup.