Database Optimization: A Practical Reference Guide

October 4, 2025 📝 Blog Post

Quick Reference

Technique	Impact	Effort	Priority	When to Use
Predicate Pushdown	10x+	Medium	High	Selective queries
Row Group Pruning	5-10x	Low	High	Columnar storage
Result Caching	100-1000x	Low	High	Repeated queries
Async I/O	2-3x	Medium	Medium	Multi-partition reads
Database Indexes	10-100x	Low	High	Point queries
SIMD	Variable	High	Low	Hot CPU paths

Database optimization follows a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late.

This guide covers six optimization techniques with performance impact, code examples, and when to apply each. Part 2 covers a real case study replacing DuckDB.

Predicate Pushdown

Predicate pushdown filters data at the storage layer instead of in memory. For columnar databases, this reduces I/O by 70-90% for selective queries.

The technique uses two levels: row group pruning (skip entire chunks based on statistics) and record batch filtering (filter before deserializing). Both levels work together to minimize data movement.

Performance Impact

Filter Selectivity	Data Kept	I/O Reduction	Expected Speedup
High (1% match)	1%	90-95%	10-20x
Medium (30% match)	30%	60-70%	3-5x
Low (70% match)	70%	20-30%	1.5-2x

Implementation

pub fn filter_to_arrow_predicate(
    batch: &RecordBatch,
    filter: &QueryFilter,
) -> Result<BooleanArray> {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;
            let close_array = batch.column(4)
                .as_any()
                .downcast_ref::<Float64Array>()?;

            let threshold_array = Float64Array::from(
                vec![threshold; close_array.len()]
            );

            // Arrow compute kernel - SIMD accelerated
            gt_eq(close_array, &threshold_array)
        }
    }
}

The code translates query filters to Arrow compute predicates. Arrow’s gt_eq kernel uses SIMD instructions automatically. No manual vectorization required.

When to Use

Use predicate pushdown when:

Storage format is columnar (Parquet, Arrow, ORC)
Queries include WHERE clauses
Filters are selective (high cardinality columns)
I/O is the bottleneck

Don’t use when:

Queries scan full tables
Storage format lacks statistics
Filters have low selectivity (>80% of rows match)

Trade-offs

Predicate pushdown requires storage statistics. Parquet and ORC write min/max stats by default. Legacy formats may not support this.

False positives are possible. A row group with range [440, 460] will be read for filter “close >= 450” even if all values are 440-449. This wastes I/O but doesn’t return incorrect results.

Implementation complexity is medium. Requires understanding of storage format internals and Arrow compute kernels.

Row Group Pruning

Row group pruning skips entire data chunks based on metadata statistics. For a query with filter “close >= 450”, any row group with max_close < 450 can be skipped entirely.

This optimization is specific to formats that partition data into groups (Parquet row groups, ORC stripes). Metadata is small (~10KB) compared to data (~1MB per group).

How It Works

Parquet files contain metadata with statistics per row group:

Parquet File
├── Metadata (10KB)
│   ├── Row Group 0: {min_close: 440.0, max_close: 445.0}
│   ├── Row Group 1: {min_close: 445.0, max_close: 450.0}
│   └── Row Group 2: {min_close: 450.0, max_close: 455.0}
├── Row Group 0 (100KB compressed)
├── Row Group 1 (100KB compressed)
└── Row Group 2 (100KB compressed)

For filter “close >= 450”, the pruning logic is:

Row Group	min_close	max_close	Action	Reason
0	440.0	445.0	SKIP	max < 450 (all rows fail)
1	445.0	450.0	READ	overlaps range
2	450.0	455.0	READ	all rows match

Row Group 0 is never read from disk. I/O saved: 33%.

Implementation

fn filter_matches_row_group(
    row_group: &RowGroupMetaData,
    filter: &QueryFilter,
) -> bool {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;

            if let Some(stats) = row_group.column(4).statistics() {
                if let Some(max_bytes) = stats.max_bytes_opt() {
                    if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
                        // Skip if max < threshold
                        return max_close >= threshold;
                    }
                }
            }

            // No stats = assume match (safe but not optimal)
            true
        }
    }
}

Statistics parsing uses little-endian byte order. If statistics are missing, the function returns true to avoid skipping data incorrectly.

When to Use

Use row group pruning when:

Storage format has per-group statistics (Parquet, ORC)
Data is sorted or clustered by filter columns
Row groups are large enough (>10K rows) for overhead to be worthwhile

Effectiveness depends on data distribution. If prices are randomly distributed across row groups, pruning helps less. If data is time-ordered and prices trend, pruning helps more.

Result Caching

Result caching stores query results in memory. For repeated queries, this eliminates all I/O and computation. Cache hit latency is 50-100x faster than disk read.

The cache key must include everything that affects results: symbol, time range, filters, aggregation. Hash the full query object for the key.

Performance Impact

Metric	Without Cache	With Cache (80% hit)	Speedup
P50 latency	1.5ms	0.3ms	5x
P99 latency	5ms	0.4ms	12.5x
Throughput	667 qps	3,333 qps	5x

Cache hit rate determines overall speedup. 80% hit rate is typical for dashboard workloads. Interactive analytics often exceeds 90%.

Implementation

// Check cache first
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
    let execution_time_us = start.elapsed().as_micros() as u64;
    return Ok(QueryResult {
        bars: cached_bars.clone(),
        execution_time_us,
        from_cache: true,
    });
}

// Execute query
let result = storage.read_with_filters(...)?;

// Cache result
self.cache.borrow_mut().put(query.clone(), result.clone());

The implementation uses LRU eviction. Least recently used entries are evicted when cache is full. TTL (time-to-live) can be added for data that changes frequently.

When to Use

Use caching when:

Queries repeat (dashboards, APIs)
Data is read-heavy (10:1 read:write ratio or higher)
Results fit in memory
Stale data is acceptable for short periods

Don’t use when:

Every query is unique
Data changes frequently (high write rate)
Memory is limited
Consistency is critical

Cache Invalidation

Cache invalidation is the hard part. Three strategies:

Strategy	Accuracy	Complexity	Use Case
TTL (time-based)	Low	Low	Infrequent writes
Write-through	High	Medium	Moderate write rate
Version stamps	High	High	Complex dependencies

TTL invalidation expires entries after N seconds. Simple but may serve stale data.

Write-through invalidation clears affected entries on write. Requires tracking which queries touch which data.

Version stamps attach a version number to data. Cache entries include the version. Increment version on write.

Async I/O for Multi-Partition Reads

Async I/O parallelizes reads from multiple partitions. For time-series databases partitioned by day, a query spanning 5 days can read all 5 partitions concurrently.

The speedup depends on storage parallelism. SSDs support high queue depth. HDDs benefit less from parallelism.

Speedup Analysis

Partitions	Sequential	Parallel	Speedup
1	1.0ms	1.0ms	1.0x
2	2.0ms	1.1ms	1.8x
5	5.0ms	1.8ms	2.8x
10	10.0ms	3.2ms	3.1x

Speedup saturates around 3x. Synchronization overhead and I/O contention limit scaling. Network-attached storage may scale better than local disks.

Implementation

// Launch concurrent reads
let futures: Vec<_> = partitions
    .iter()
    .map(|partition| {
        let storage = storage.clone();
        let symbol = symbol.to_string();
        let partition = partition.to_string();

        async move {
            storage.read_async(&symbol, &partition).await
        }
    })
    .collect();

// Wait for all reads
let results = futures::future::join_all(futures).await;

// Merge results
let mut all_bars = Vec::new();
for result in results {
    all_bars.extend(result?);
}

The code uses tokio::task::spawn_blocking to avoid blocking async runtime. Each partition read happens in a thread pool.

When to Use

Use async I/O when:

Queries span multiple partitions
Storage supports parallelism (SSD, distributed storage)
Partitions are independent (no cross-partition dependencies)
Latency matters more than CPU usage

Don’t use when:

Single partition queries dominate
Storage is slow (HDD) or saturated
Thread overhead exceeds I/O time

Async I/O adds complexity. For single-partition queries, synchronous reads are simpler and faster.

Database Indexes

Indexes are lookup structures that accelerate queries. B-tree indexes support range queries. Hash indexes support equality. Bitmap indexes support low-cardinality columns.

The trade-off is write speed vs read speed. Every insert must update all indexes. For write-heavy workloads, indexes may slow overall throughput.

Index Type Comparison

Index Type	Best For	Write Overhead	Read Speedup
B-tree	Range queries	20-30%	10-100x
Hash	Equality	10-20%	50-200x
Bitmap	Low cardinality	5-10%	20-50x
Covering	SELECT subset	30-50%	100-1000x

B-tree indexes are general-purpose. They support <, >, =, BETWEEN. Write overhead is moderate.

Hash indexes are faster for exact match. They don’t support range queries. Write overhead is lower than B-tree.

Bitmap indexes compress well for columns with few distinct values (country, status, category). They support AND/OR operations efficiently.

Covering indexes include all columns in SELECT clause. Queries never touch base table. Write overhead is high because index duplicates data.

When to Index

Index when:

Query is slow due to full table scan
Filter selectivity is high (<10% of rows match)
Table is large (>100K rows)
Read:write ratio is high (>10:1)

Don’t index when:

Table is small (<10K rows)
Filter selectivity is low (>30% of rows match)
Write rate is high (index maintenance overhead exceeds query savings)

Trade-offs

Indexes consume disk space. A B-tree index is ~50% of indexed column size. Multiple indexes multiply storage cost.

Indexes must be maintained. INSERT, UPDATE, DELETE operations update all indexes. For bulk loads, drop indexes, load data, rebuild indexes.

Index statistics must be current. Stale statistics lead to poor query plans. Run ANALYZE after significant data changes.

When SIMD Fails

SIMD (Single Instruction Multiple Data) processes 2-8 values per CPU instruction. Modern CPUs support AVX-2 (256-bit) and AVX-512 (512-bit) instructions.

SIMD requires specific data layout. Struct-of-Arrays (SoA) layout works. Array-of-Structs (AoS) layout doesn’t.

Data Layout Compatibility

Data Layout	SIMD Compatible	Why
Struct-of-Arrays	✅ Yes	Contiguous fields
Array-of-Structs	❌ No	Scattered fields
Columnar (Arrow)	✅ Yes	Already columnar

For time-series data stored as Vec<OHLCV>, each OHLCV is a struct:

struct OHLCV {
    timestamp: i64,
    open: f64,
    high: f64,
    low: f64,
    close: f64,
    volume: u64,
}

This is AoS layout. To SIMD-process the close field, the code must extract every 5th element (skipping timestamp, open, high, low). Extraction overhead exceeds SIMD benefit.

Why It Failed

Attempted SIMD aggregation on AoS data resulted in 40-86% regression:

Operation	Scalar (ns)	SIMD (ns)	Result
VWAP	1,340	2,486	0.54x (86% regression)
Average	412	687	0.60x (67% regression)
Sum	339	566	0.60x (67% regression)

Field extraction dominated runtime. Scalar code accessed struct fields directly. SIMD code copied fields to temporary arrays.

The Fix

Use columnar storage (Parquet/Arrow). Data is already in SoA layout. Arrow compute kernels use SIMD automatically.

// Arrow compute - SIMD automatic
let close_array = batch.column(4)
    .as_any()
    .downcast_ref::<Float64Array>()?;

// gt_eq uses SIMD internally
let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;

No manual SIMD code required. Arrow handles it at the library level.

Lesson Learned

Data layout matters more than instruction optimization. Columnar storage enables SIMD without manual work. Attempting SIMD on AoS layout wastes time and regresses performance.

Optimization Decision Tree

Start by profiling. Find the bottleneck before optimizing.

Is query slow?
├─ Yes, I/O-bound?
│  ├─ Selective filters? → Predicate Pushdown (10x)
│  ├─ Multiple partitions? → Async I/O (3x)
│  └─ Repeated queries? → Caching (100x+)
└─ Yes, CPU-bound?
   ├─ Hot aggregation path? → SIMD (if columnar)
   └─ Complex computation? → Algorithm optimization

For I/O-bound queries, optimize I/O first. CPU optimization won’t help.

For CPU-bound queries, verify data layout supports optimization. Don’t assume SIMD helps without measuring.

Priority Order

Profile to find bottleneck (measure, don’t guess)
Cache if queries repeat (100-1000x gain, low effort)
Predicate pushdown if filters are selective (10x gain, medium effort)
Async I/O if multi-partition (3x gain, medium effort)
Indexes if point queries (10-100x gain, low effort)
Only then consider CPU optimizations

The highest-leverage optimizations eliminate work entirely. Caching eliminates all work. Predicate pushdown eliminates 90% of I/O. CPU optimization makes remaining work 2x faster.

Optimize to eliminate work before optimizing to do work faster.

Summary

Database optimization follows a hierarchy: eliminate work, reduce I/O, then optimize CPU.

Predicate pushdown reduces I/O by filtering at storage layer. Row group pruning skips data chunks based on statistics. Result caching eliminates repeated computation. Async I/O parallelizes multi-partition reads. Indexes accelerate lookups. SIMD requires columnar data layout.

The highest gains come from I/O reduction, not CPU optimization. A 10x I/O reduction beats a 2x CPU speedup.

Part 2 covers a real case study: replacing DuckDB with Rust and predicate pushdown for 10.4x speedup.