I Thought Redis Was Our Bottleneck (I Was Completely Wrong)

We blamed Redis for our Black Friday meltdown. Flame graphs told a different story. Here’s where our feature store latency actually came from.

Last year, I watched our feature store completely melt down during Black Friday traffic. P99 latency spiked dramatically, climbing well into the hundreds of milliseconds. Models that depended on real-time fraud signals started timing out left and right. We lost a painful amount in blocked legitimate transactions before we could roll back to degraded fallbacks. The exact figure? It depends on your transaction volume and approval rates, but for us, it was enough to trigger a full post-mortem and some very uncomfortable meetings.

That incident fundamentally changed how I think about feature store latency best practices for real time ml models. Turns out everything I thought I knew was wrong. Or at least dangerously incomplete.

Here’s what this playbook actually covers: the counterintuitive optimizations that moved the needle when conventional wisdom failed us. I’ll share flame graphs, benchmark data, and the specific Feast configurations that got our production system to sub-10ms p99 latency.

This isn’t theory. It’s the playbook my team built after three major production firefights, countless late-night debugging sessions, and some genuinely humbling moments where I had to admit my assumptions were garbage. (And trust me, that last part stung.)

If you’re fighting latency in your feature serving layer, I really hope this saves you some of the pain we went through.

Anatomy of Feature Serving Latency: Profiling the Actual Bottlenecks

Before optimizing anything, you need to know where your time actually goes. Sounds obvious, right? But I’ve watched teams spend weeks optimizing the wrong layer because they just assumed the database was the bottleneck.

When our team profiled the feature serving path with py-spy flame graphs, the results honestly surprised us:

# What we expected to see eating our latency budget:
# - Redis network round trips: 60%
# - Feature transformation: 30%
# - Serialization: 10%

# What we actually found (in our specific system - yours will vary):
# - Feature transformation (runtime joins): largest contributor
# - Connection pool exhaustion/waiting: significant
# - Redis round trips: smaller than expected
# - Serialization/deserialization: minimal

So what ate most of our latency? Transformations happening at serving time. Joins across feature groups. Type coercions. Null handling logic that triggered on every single request. Your breakdown will definitely differ based on your architecture, workload, and configuration, but the point stands: our assumptions about the bottleneck were completely wrong.

And the connection pool issue? Sneaky. Under normal load, plenty of connections sat available. But during traffic spikes? Threads waited around for connections to free up. Our dashboards showed this as “Redis latency,” but Redis wasn’t actually the problem.

Profile first. Always. Your bottleneck probably isn’t where you think it is.

Backend Showdown: Redis vs DynamoDB vs Custom Solutions

The redis vs dynamodb for feature store backend debate comes up constantly. After running both in production for over a year, here’s what I’ve actually learned.

Our benchmark methodology:

Testing used realistic feature payloads, not synthetic benchmarks. A typical entity lookup retrieves dozens of features across multiple feature groups for a single user ID. Measurements came from application code, not at the database level, because that’s what your models actually experience.

Metric	Redis Cluster	DynamoDB	Custom RocksDB
P50 Latency	~1-2ms	~3-4ms	<1ms
P99 Latency	~4-5ms	~10-15ms	~3-4ms
P99.9 Latency	<10ms	High variance	<7ms
Relative Cost	Low	Higher	Lowest

Note: These are representative ranges from our testing. Your results will vary based on configuration, payload size, and infrastructure.

Honestly? DynamoDB’s tail latency killed us. Those P99.9 spikes happen exactly when you don’t want them, during high-traffic periods when DynamoDB partitions get hot.

Redis won for our use case. But here’s the thing: if you’re optimizing operational latency in feature stores for online inference, the backend choice matters less than you might think. Bigger wins came from what happened before the request even hit the backend.

The Streaming-Batch Hybrid Pattern

Pure streaming architectures for feature stores are expensive and operationally complex. Pure batch means stale features. So what do you do? The streaming feature pipelines for real time data freshness problem have a middle path that works surprisingly well.

My team calls it “tiered freshness.” Different features have different staleness tolerances, and that’s actually okay.

# Tiered freshness configuration
feature_groups = {
    "user_profile": {
        "update_frequency": "daily",
        "staleness_tolerance": "24h",
        "pipeline": "batch"
    },
    "user_session": {
        "update_frequency": "5min", 
        "staleness_tolerance": "10min",
        "pipeline": "micro_batch"
    },
    "fraud_signals": {
        "update_frequency": "realtime",
        "staleness_tolerance": "30s",
        "pipeline": "streaming"
    }
}

Here’s what we realized: only a small fraction of our features actually needed real-time freshness. Everything was streaming because nobody had done the analysis to figure out what actually mattered. Sound familiar?

This hybrid approach significantly cut our streaming infrastructure costs while maintaining the freshness guarantees that actually impacted model performance. Understanding how to reduce data freshness lag in feature stores starts with knowing which features need fresh data in the first place.

Pre-computation Strategies: Moving Work Out of the Serving Path

This is where our team got the biggest wins. The batch vs streaming feature updates latency tradeoffs conversation usually focuses on freshness. But there’s another dimension people miss: where do transformations happen?

Serving-time transformations we eliminated:

Feature group joins moved to materialization. Instead of joining user features with session features at serving time, the joined view gets materialized ahead of time.
Null imputation became a materialization step. Default values get written to the store, not computed per-request.
Type coercions are handled during write. Everything in our online store is already in the exact format the model expects.

# Before: Transformation at serving time
def get_features(user_id):
    user_features = redis.get(f"user:{user_id}")
    session_features = redis.get(f"session:{user_id}")
    
    # These operations happened on EVERY request
    combined = join_features(user_features, session_features)
    imputed = handle_nulls(combined, default_values)
    typed = coerce_types(imputed, model_schema)
    return typed

# After: Pre-computed at materialization
def get_features(user_id):
    # Single read, already joined/imputed/typed
    return redis.get(f"serving:{user_id}")

This single change cut our P99 by more than half. Storage space and materialization compute increased, but serving-time latency dropped dramatically. For us, that tradeoff made total sense.

TTL and Caching Configuration That Actually Works

Most feature store ttl configuration optimization guide articles miss the real problem: thundering herds.

When TTLs expire synchronously across popular entities, you get a stampede of cache misses hitting your materialization layer all at once. Learning this one hurt.

Our TTL strategy:

# Add jitter to prevent synchronized expiration
import random


![](/api/saas/assets/I6gIeUbBwkgtYo6DCMa29scWUXs2/johnscienceland/images/posts/21-01-2026/feature-store-data-freshness-a-1769014081660/content-2.png)


def set_feature_ttl(base_ttl_seconds):
    jitter = random.uniform(0.8, 1.2)
    return int(base_ttl_seconds * jitter)

# For hot entities, use longer TTLs with background refresh
hot_entity_config = {
    "ttl": 3600,  # 1 hour nominal
    "soft_ttl": 2700,  # Trigger background refresh at 45min
    "jitter_factor": 0.2
}

Feature store caching strategies for low-latency serving need to account for your actual traffic patterns. Analysis of our entity access distribution showed that a tiny fraction of entities account for the vast majority of feature retrievals, classic power-law stuff. Hot entities get special treatment: longer TTLs, proactive background refresh, and redundant caching layers.

Cold entities? Shorter TTLs work fine, and occasional cache misses get accepted as the cost of doing business. The latency impact is minimal because these requests are pretty rare anyway.

Feast Configuration: Getting to Sub-10ms P99

Let me share the specific settings that actually worked for us. If you want to configure Feast for sub-10ms feature serving, these are the knobs that matter.

# feast_config.yaml
online_store:
  type: redis
  connection_string: "redis-cluster.internal:6379"
  
  # Connection pooling - this was huge for us
  redis_pool_size: 50
  redis_pool_timeout_ms: 100
  
  # Batch reads instead of individual gets
  enable_batch_reads: true
  batch_size: 100

# Feature retrieval settings  
feature_server:
  # Pre-load feature views on startup
  eager_load: true
  
  # Disable runtime type checking (we validate at write time)
  skip_type_validation: true
  
  # Use binary serialization
  serialization_format: "arrow"

# Python-side optimizations
from feast import FeatureStore
import asyncio

store = FeatureStore(repo_path=".")

# Use async retrieval for concurrent entity lookups
async def get_features_batch(entity_ids: list):
    tasks = [
        store.get_online_features_async(
            features=feature_refs,
            entity_rows=[{"user_id": eid}]
        )
        for eid in entity_ids
    ]
    return await asyncio.gather(*tasks)

The feast vs tecton latency performance comparison honestly depends heavily on your specific workload. In our benchmarks, properly tuned Feast achieved sub-10ms P99. Tecton’s managed infrastructure showed similar numbers out of the box but with less tuning flexibility.

For feature store performance tuning for high-throughput systems, the biggest Feast-specific wins came from:

Connection pool sizing (too small means waiting; too large means memory overhead)
Batch reads for multi-entity lookups
Arrow serialization instead of JSON
Disabling runtime validation that wasn’t actually needed

After three production incidents and eighteen months of optimization, here’s the checklist I wish I’d had at the start.

Profile before optimizing:

Generate flame graphs for your actual serving path
Identify whether latency is network, transformation, or waiting
Measure at P99.9, not just P50

Backend selection:

Test with realistic payloads, not synthetic benchmarks
Evaluate tail latency under load, not just steady state
Consider Redis unless you’ve got specific DynamoDB requirements

Reduce serving-time work:

Move joins to materialization
Pre-compute null imputation and type coercions
Audit every transformation in your serving path

Cache configuration:

Add TTL jitter to prevent thundering herds
Implement tiered TTLs based on entity popularity
Use background refresh for hot entities

Know when to stop:

Define your actual latency requirements
Accept “good enough” freshness tradeoffs where they make sense
Remember that strategies to minimize feature staleness in real time ml environments have seriously diminishing returns

Look, you might not even need sub-10ms latency. I’ve seen teams burn months chasing latency targets that didn’t actually impact model performance or business metrics. Do the math on what latency actually means for your specific use case before you start optimizing.

But if you do need it, and you’ve hit that wall where conventional wisdom just isn’t working, I hope this playbook helps. These feature store latency best practices for real time ml models came from real production pain. May your on-call rotations be quieter than mine were.

Author

Ryan Christopher
Ryan Christopher is a seasoned Data Science Specialist with 8 years of professional experience based in Philadelphia, PA (Glen Falls Road). With a Bachelor of Science in Data Science from Penn State University (Class of 2019), Ryan combines academic rigor with practical expertise to drive data-driven decision-making and innovation.