Quick Reference

TechniqueImpactEffortPriorityWhen to Use
Predicate Pushdown10x+MediumHighSelective queries
Row Group Pruning5-10xLowHighColumnar storage
Result Caching100-1000xLowHighRepeated queries
Async I/O2-3xMediumMediumMulti-partition reads
Database Indexes10-100xLowHighPoint queries
SIMDVariableHighLowHot CPU paths

Database optimization follows a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late.

This guide covers six optimization techniques with performance impact, code examples, and when to apply each. Part 2 covers a real case study replacing DuckDB.

Predicate Pushdown

Predicate pushdown filters data at the storage layer instead of in memory. For columnar databases, this reduces I/O by 70-90% for selective queries.

The technique uses two levels: row group pruning (skip entire chunks based on statistics) and record batch filtering (filter before deserializing). Both levels work together to minimize data movement.

Performance Impact

Filter SelectivityData KeptI/O ReductionExpected Speedup
High (1% match)1%90-95%10-20x
Medium (30% match)30%60-70%3-5x
Low (70% match)70%20-30%1.5-2x

Implementation

pub fn filter_to_arrow_predicate(
    batch: &RecordBatch,
    filter: &QueryFilter,
) -> Result<BooleanArray> {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;
            let close_array = batch.column(4)
                .as_any()
                .downcast_ref::<Float64Array>()?;

            let threshold_array = Float64Array::from(
                vec![threshold; close_array.len()]
            );

            // Arrow compute kernel - SIMD accelerated
            gt_eq(close_array, &threshold_array)
        }
    }
}

The code translates query filters to Arrow compute predicates. Arrow’s gt_eq kernel uses SIMD instructions automatically. No manual vectorization required.

When to Use

Use predicate pushdown when:

  • Storage format is columnar (Parquet, Arrow, ORC)
  • Queries include WHERE clauses
  • Filters are selective (high cardinality columns)
  • I/O is the bottleneck

Don’t use when:

  • Queries scan full tables
  • Storage format lacks statistics
  • Filters have low selectivity (>80% of rows match)

Trade-offs

Predicate pushdown requires storage statistics. Parquet and ORC write min/max stats by default. Legacy formats may not support this.

False positives are possible. A row group with range [440, 460] will be read for filter β€œclose >= 450” even if all values are 440-449. This wastes I/O but doesn’t return incorrect results.

Implementation complexity is medium. Requires understanding of storage format internals and Arrow compute kernels.

Row Group Pruning

Row group pruning skips entire data chunks based on metadata statistics. For a query with filter β€œclose >= 450”, any row group with max_close < 450 can be skipped entirely.

This optimization is specific to formats that partition data into groups (Parquet row groups, ORC stripes). Metadata is small (~10KB) compared to data (~1MB per group).

How It Works

Parquet files contain metadata with statistics per row group:

Parquet File
β”œβ”€β”€ Metadata (10KB)
β”‚   β”œβ”€β”€ Row Group 0: {min_close: 440.0, max_close: 445.0}
β”‚   β”œβ”€β”€ Row Group 1: {min_close: 445.0, max_close: 450.0}
β”‚   └── Row Group 2: {min_close: 450.0, max_close: 455.0}
β”œβ”€β”€ Row Group 0 (100KB compressed)
β”œβ”€β”€ Row Group 1 (100KB compressed)
└── Row Group 2 (100KB compressed)

For filter β€œclose >= 450”, the pruning logic is:

Row Groupmin_closemax_closeActionReason
0440.0445.0SKIPmax < 450 (all rows fail)
1445.0450.0READoverlaps range
2450.0455.0READall rows match

Row Group 0 is never read from disk. I/O saved: 33%.

Implementation

fn filter_matches_row_group(
    row_group: &RowGroupMetaData,
    filter: &QueryFilter,
) -> bool {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;

            if let Some(stats) = row_group.column(4).statistics() {
                if let Some(max_bytes) = stats.max_bytes_opt() {
                    if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
                        // Skip if max < threshold
                        return max_close >= threshold;
                    }
                }
            }

            // No stats = assume match (safe but not optimal)
            true
        }
    }
}

Statistics parsing uses little-endian byte order. If statistics are missing, the function returns true to avoid skipping data incorrectly.

When to Use

Use row group pruning when:

  • Storage format has per-group statistics (Parquet, ORC)
  • Data is sorted or clustered by filter columns
  • Row groups are large enough (>10K rows) for overhead to be worthwhile

Effectiveness depends on data distribution. If prices are randomly distributed across row groups, pruning helps less. If data is time-ordered and prices trend, pruning helps more.

Result Caching

Result caching stores query results in memory. For repeated queries, this eliminates all I/O and computation. Cache hit latency is 50-100x faster than disk read.

The cache key must include everything that affects results: symbol, time range, filters, aggregation. Hash the full query object for the key.

Performance Impact

MetricWithout CacheWith Cache (80% hit)Speedup
P50 latency1.5ms0.3ms5x
P99 latency5ms0.4ms12.5x
Throughput667 qps3,333 qps5x

Cache hit rate determines overall speedup. 80% hit rate is typical for dashboard workloads. Interactive analytics often exceeds 90%.

Implementation

// Check cache first
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
    let execution_time_us = start.elapsed().as_micros() as u64;
    return Ok(QueryResult {
        bars: cached_bars.clone(),
        execution_time_us,
        from_cache: true,
    });
}

// Execute query
let result = storage.read_with_filters(...)?;

// Cache result
self.cache.borrow_mut().put(query.clone(), result.clone());

The implementation uses LRU eviction. Least recently used entries are evicted when cache is full. TTL (time-to-live) can be added for data that changes frequently.

When to Use

Use caching when:

  • Queries repeat (dashboards, APIs)
  • Data is read-heavy (10:1 read:write ratio or higher)
  • Results fit in memory
  • Stale data is acceptable for short periods

Don’t use when:

  • Every query is unique
  • Data changes frequently (high write rate)
  • Memory is limited
  • Consistency is critical

Cache Invalidation

Cache invalidation is the hard part. Three strategies:

StrategyAccuracyComplexityUse Case
TTL (time-based)LowLowInfrequent writes
Write-throughHighMediumModerate write rate
Version stampsHighHighComplex dependencies

TTL invalidation expires entries after N seconds. Simple but may serve stale data.

Write-through invalidation clears affected entries on write. Requires tracking which queries touch which data.

Version stamps attach a version number to data. Cache entries include the version. Increment version on write.

Async I/O for Multi-Partition Reads

Async I/O parallelizes reads from multiple partitions. For time-series databases partitioned by day, a query spanning 5 days can read all 5 partitions concurrently.

The speedup depends on storage parallelism. SSDs support high queue depth. HDDs benefit less from parallelism.

Speedup Analysis

PartitionsSequentialParallelSpeedup
11.0ms1.0ms1.0x
22.0ms1.1ms1.8x
55.0ms1.8ms2.8x
1010.0ms3.2ms3.1x

Speedup saturates around 3x. Synchronization overhead and I/O contention limit scaling. Network-attached storage may scale better than local disks.

Implementation

// Launch concurrent reads
let futures: Vec<_> = partitions
    .iter()
    .map(|partition| {
        let storage = storage.clone();
        let symbol = symbol.to_string();
        let partition = partition.to_string();

        async move {
            storage.read_async(&symbol, &partition).await
        }
    })
    .collect();

// Wait for all reads
let results = futures::future::join_all(futures).await;

// Merge results
let mut all_bars = Vec::new();
for result in results {
    all_bars.extend(result?);
}

The code uses tokio::task::spawn_blocking to avoid blocking async runtime. Each partition read happens in a thread pool.

When to Use

Use async I/O when:

  • Queries span multiple partitions
  • Storage supports parallelism (SSD, distributed storage)
  • Partitions are independent (no cross-partition dependencies)
  • Latency matters more than CPU usage

Don’t use when:

  • Single partition queries dominate
  • Storage is slow (HDD) or saturated
  • Thread overhead exceeds I/O time

Async I/O adds complexity. For single-partition queries, synchronous reads are simpler and faster.

Database Indexes

Indexes are lookup structures that accelerate queries. B-tree indexes support range queries. Hash indexes support equality. Bitmap indexes support low-cardinality columns.

The trade-off is write speed vs read speed. Every insert must update all indexes. For write-heavy workloads, indexes may slow overall throughput.

Index Type Comparison

Index TypeBest ForWrite OverheadRead Speedup
B-treeRange queries20-30%10-100x
HashEquality10-20%50-200x
BitmapLow cardinality5-10%20-50x
CoveringSELECT subset30-50%100-1000x

B-tree indexes are general-purpose. They support <, >, =, BETWEEN. Write overhead is moderate.

Hash indexes are faster for exact match. They don’t support range queries. Write overhead is lower than B-tree.

Bitmap indexes compress well for columns with few distinct values (country, status, category). They support AND/OR operations efficiently.

Covering indexes include all columns in SELECT clause. Queries never touch base table. Write overhead is high because index duplicates data.

When to Index

Index when:

  • Query is slow due to full table scan
  • Filter selectivity is high (<10% of rows match)
  • Table is large (>100K rows)
  • Read:write ratio is high (>10:1)

Don’t index when:

  • Table is small (<10K rows)
  • Filter selectivity is low (>30% of rows match)
  • Write rate is high (index maintenance overhead exceeds query savings)

Trade-offs

Indexes consume disk space. A B-tree index is ~50% of indexed column size. Multiple indexes multiply storage cost.

Indexes must be maintained. INSERT, UPDATE, DELETE operations update all indexes. For bulk loads, drop indexes, load data, rebuild indexes.

Index statistics must be current. Stale statistics lead to poor query plans. Run ANALYZE after significant data changes.

When SIMD Fails

SIMD (Single Instruction Multiple Data) processes 2-8 values per CPU instruction. Modern CPUs support AVX-2 (256-bit) and AVX-512 (512-bit) instructions.

SIMD requires specific data layout. Struct-of-Arrays (SoA) layout works. Array-of-Structs (AoS) layout doesn’t.

Data Layout Compatibility

Data LayoutSIMD CompatibleWhy
Struct-of-Arraysβœ… YesContiguous fields
Array-of-Structs❌ NoScattered fields
Columnar (Arrow)βœ… YesAlready columnar

For time-series data stored as Vec<OHLCV>, each OHLCV is a struct:

struct OHLCV {
    timestamp: i64,
    open: f64,
    high: f64,
    low: f64,
    close: f64,
    volume: u64,
}

This is AoS layout. To SIMD-process the close field, the code must extract every 5th element (skipping timestamp, open, high, low). Extraction overhead exceeds SIMD benefit.

Why It Failed

Attempted SIMD aggregation on AoS data resulted in 40-86% regression:

OperationScalar (ns)SIMD (ns)Result
VWAP1,3402,4860.54x (86% regression)
Average4126870.60x (67% regression)
Sum3395660.60x (67% regression)

Field extraction dominated runtime. Scalar code accessed struct fields directly. SIMD code copied fields to temporary arrays.

The Fix

Use columnar storage (Parquet/Arrow). Data is already in SoA layout. Arrow compute kernels use SIMD automatically.

// Arrow compute - SIMD automatic
let close_array = batch.column(4)
    .as_any()
    .downcast_ref::<Float64Array>()?;

// gt_eq uses SIMD internally
let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;

No manual SIMD code required. Arrow handles it at the library level.

Lesson Learned

Data layout matters more than instruction optimization. Columnar storage enables SIMD without manual work. Attempting SIMD on AoS layout wastes time and regresses performance.

Optimization Decision Tree

Start by profiling. Find the bottleneck before optimizing.

Is query slow?
β”œβ”€ Yes, I/O-bound?
β”‚  β”œβ”€ Selective filters? β†’ Predicate Pushdown (10x)
β”‚  β”œβ”€ Multiple partitions? β†’ Async I/O (3x)
β”‚  └─ Repeated queries? β†’ Caching (100x+)
└─ Yes, CPU-bound?
   β”œβ”€ Hot aggregation path? β†’ SIMD (if columnar)
   └─ Complex computation? β†’ Algorithm optimization

For I/O-bound queries, optimize I/O first. CPU optimization won’t help.

For CPU-bound queries, verify data layout supports optimization. Don’t assume SIMD helps without measuring.

Priority Order

  1. Profile to find bottleneck (measure, don’t guess)
  2. Cache if queries repeat (100-1000x gain, low effort)
  3. Predicate pushdown if filters are selective (10x gain, medium effort)
  4. Async I/O if multi-partition (3x gain, medium effort)
  5. Indexes if point queries (10-100x gain, low effort)
  6. Only then consider CPU optimizations

The highest-leverage optimizations eliminate work entirely. Caching eliminates all work. Predicate pushdown eliminates 90% of I/O. CPU optimization makes remaining work 2x faster.

Optimize to eliminate work before optimizing to do work faster.

Summary

Database optimization follows a hierarchy: eliminate work, reduce I/O, then optimize CPU.

Predicate pushdown reduces I/O by filtering at storage layer. Row group pruning skips data chunks based on statistics. Result caching eliminates repeated computation. Async I/O parallelizes multi-partition reads. Indexes accelerate lookups. SIMD requires columnar data layout.

The highest gains come from I/O reduction, not CPU optimization. A 10x I/O reduction beats a 2x CPU speedup.

Part 2 covers a real case study: replacing DuckDB with Rust and predicate pushdown for 10.4x speedup.