We Were Spending 18% of Our Infra Budget on Logs (Just Logs)
We cut ML costs by 70% in four months. Turns out 18% of our infra budget was going to logs, and most inference calls didn’t even need our fancy model.
I’m going to tell you about the worst email I’ve ever received as an advisor. It wasn’t a failed product launch. It wasn’t a security breach. It was a single line from our cloud provider: “Your estimated charges for this billing period: $47,284.”
That number? One month of inference costs for a fraud detection system at a fintech startup where I’d been helping optimize their AI stack. They had 14 months of runway left. At that burn rate, their ML infrastructure alone would chew through nearly half of it before they hit profitability targets. Something had to change. Fast.
This isn’t theoretical advice from someone who skimmed a white paper. It’s a detailed breakdown of how my team systematically slashed machine learning costs by 70% over four months. I’m talking exact techniques that worked, approaches that failed spectacularly, and the compliance landmines we nearly stepped on.
By the time you finish reading this, you’ll have a reproducible playbook for startup machine learning budget optimization. I’ll share specific numbers, the tools we actually used, and a decision framework you can steal for your own infrastructure. Fair warning, though: some of what we learned was painful.
Auditing the Damage: Finding the Real Cost Drivers
Before you can fix a problem, you need to actually see it. Sounds obvious, right? But you’d be surprised how many teams treat their ML costs as a single opaque line item, just one mysterious number on a spreadsheet that keeps getting bigger.
The first two weeks went into building what I call a “Cost Attribution Map.” Here’s how the breakdown worked:
The Tool Stack:
- AWS Cost Explorer with custom tags (retroactively tagged everything by model, feature, and team, which was tedious but worth it)
- Weights & Biases for tracking inference volume per model
- A homegrown Grafana dashboard that correlated request patterns with compute spikes
What Surfaced:
The fraud detection model accounted for 61% of total inference costs. But here’s the twist: only 23% of those inference calls were for high-value transactions where sophisticated fraud detection actually mattered. The remaining 77%? Small transactions where a much simpler rule-based system would’ve caught 94% of fraud attempts anyway.
And the second-biggest cost driver wasn’t even a model. It was logging. Full input/output payloads were shipping to our observability platform for every single inference call. That alone ate up 18% of infrastructure spend. Eighteen percent! For logs!
Reducing ML infrastructure costs starts with measurement, not optimization. You can’t compress your way out of architectural inefficiency.
Quantization in Production: INT8 Conversion Results
Once the team understood where the money was actually going, the easiest wins came first: quantization.
Neural network quantization results in the real world often exceed what papers promise, but with caveats nobody mentions in conference talks. Let me walk you through our experience.
The Setup:
- Original model: BERT-based fraud classifier, FP32 precision, ~340 MB
- Target: INT8 quantization using TensorFlow’s post-training quantization
- Hardware: AWS Inferentia instances (more on why this matters later)
The Process:
Nobody just flipped a switch. TensorFlow model compression for production environments looked like this:
- Created a representative dataset of 50,000 inference calls from production logs
- Ran post-training dynamic range quantization first (quick, dirty, helped identify problem layers)
- Moved to full INT8 quantization with calibration
- Built a shadow deployment that ran both models in parallel for two weeks
The Results:
| Metric | FP32 Original | INT8 Quantized | Change |
|---|---|---|---|
| Model Size | 340 MB | 87 MB | -74% |
| P95 Latency | 142 ms | 89 ms | -37% |
| Throughput | 312 req/s | 891 req/s | +186% |
| Accuracy | 96.4% | 95.9% | -0.5% |
| Monthly Cost | $18,400 | $6,200 | -66% |
In a fintech context, that 0.5% accuracy drop translates to roughly $2,100/month in additional manual review costs. Still a massive net win. But nobody tells you this part: confidence thresholds need to be retuned post-quantization. The model’s probability distributions shifted slightly, which meant the “flag for review” threshold of 0.73 needed to become 0.71. Small change. Big deal if you miss it.


Knowledge Distillation for Fraud Detection
Quantization got us far, but model optimization for production AI systems in fintech requires more aggressive techniques for the heavy hitters.
The fraud detection ensemble was a monster: three transformer-based models with a gradient boosting meta-learner. Accurate? Absolutely. Expensive? Catastrophically.
Knowledge distillation lets us train a smaller “student” model to mimic the larger “teacher” ensemble’s behavior.
Distillation Setup:
- Teacher: 3-model ensemble (1.2B parameters total)
- Student: Single DistilBERT variant (66M parameters)
- Training data: 2M historical transactions with soft labels from the teacher
- Training time: 3 days on 4x A100 GPUs
The Trick That Actually Worked:
Rather than distilling the entire model’s knowledge uniformly, a tiered system emerged:
- Tier 1 (Simple transactions under $100): Student model only
- Tier 2 (Transactions $100–$2,000): Student model, escalate to teacher if confidence < 0.85
- Tier 3 (High-value transactions): Teacher ensemble always
This tiered approach meant the student handled 71% of all traffic independently. The teacher ensemble only fired for 8% of requests (the remaining 21% were Tier 2 escalations).
Cost reduction from this change alone: $14,200/month. Not bad for three days of training.
Infrastructure Wins Beyond the Model

AI deployment cost savings strategies often focus on model compression, but infrastructure changes frequently deliver faster ROI. Honestly, some of these felt almost too easy.
Batch Inference:
Originally, the system processed every transaction synchronously. Real-time fraud detection, right? Except that, and this was embarrassing to discover, 40% of inference calls came from batch reconciliation jobs that ran overnight. Those didn’t need sub-200 ms latency. Not even close.
Batch inference moved onto a separate pipeline using SageMaker Batch Transform. Same models, but now Spot Instances and chunks of 1,000 transactions became possible. Cost per inference dropped from $0.0003 to $0.00004.
The Caching Layer:
This felt obvious in retrospect. A Redis cache keyed on a hash of the transaction’s feature vector caught identical or near-identical transactions (common with subscription payments) and served cached predictions.
Cache hit rate: 23% Monthly savings: $3,400
Why didn’t anyone think of this earlier? Sometimes you’re so focused on the complex stuff that you miss what’s right in front of you.
Spot Instances (With a Safety Net):
Non-critical inference workloads moved to Spot Instances with a fallback to on-demand. Here’s the specific configuration that worked:
- Spot Instance: ml.g4dn.xlarge (60% of capacity)
- On-demand fallback: ml.g4dn.xlarge (40% of capacity)
- Spot interruption handling: 2-minute warning triggers request draining to the on-demand pool
Average spot savings: 67% on those instances.
What Failed: Pruning Pitfalls and Compliance Gotchas
I promised you battle-tested results, and that includes the battles we lost. Trust me, there were a few.
Pruning Disaster:
Magnitude-based pruning to remove 50% of weights from the student model seemed promising. Papers suggested pruning up to 90% with minimal accuracy loss. Papers lied. Or rather, papers don’t operate in regulated environments.
At 50% sparsity, model accuracy dropped only 1.2%. Sounds acceptable, right? But the distribution of errors shifted. The model became significantly worse at detecting a specific fraud pattern involving international wire transfers. This particular pattern happened to be explicitly called out in compliance documentation.
Rollback happened after two days. Pruning might work for your use case, but test thoroughly across error types, not just aggregate metrics. That lesson came the hard way.
The Compliance Gotcha:
Regulators require model explainability for any decision affecting customers. The original ensemble had SHAP integration that produced compliant explanations. The distilled student model? SHAP values weren’t directly comparable.
Three weeks. Three weeks went into building a translation layer that mapped student model explanations to the format that compliance had approved. That wasn’t in the original timeline. Budget for regulatory surprises. You’ll thank me later.
Over-Aggressive Quantization:
Someone on the team read a paper about INT4 quantization. Don’t be us. INT4 introduced enough accuracy degradation that false positive rates spiked 340%. Customer complaints followed. Reversion happened within hours.
Where Everything Landed
Let me show you the results after four months:
Before:
- Total monthly ML infrastructure cost: $47,284
- Primary drivers: Inference compute (61%), logging (18%), training jobs (21%)
After:
- Total monthly ML infrastructure cost: $13,847
- Reduction: 70.7%
Breakdown of Savings:
| Technique | Monthly Savings |
|---|---|
| INT8 Quantization | $12,200 |
| Knowledge Distillation + Tiering | $14,200 |
| Batch Inference Migration | $4,100 |
| Inference Caching | $3,400 |
| Spot Instances | $2,800 |
| Logging Optimization | $1,700 |
Implementation Timeline:
- Weeks 1–2: Cost audit and attribution
- Weeks 3–5: Quantization testing and deployment
- Weeks 6–10: Knowledge distillation training and tiered architecture
- Weeks 11–14: Infrastructure optimizations
- Weeks 15–16: Monitoring, compliance fixes, stabilization
Your Decision Framework:
When I work with startups on slashing their AI costs, this prioritization matrix guides decisions:
| Technique | Implementation Effort | Risk Level | Typical Savings |
|---|---|---|---|
| Logging optimization | Low | Low | 10–20% |
| Inference caching | Low | Low | 5–15% |
| Batch inference separation | Medium | Low | 15–25% |
| Quantization | Medium | Medium | 30–50% |
| Spot Instances | Medium | Medium | 20–40% |
| Knowledge distillation | High | Medium–High | 40–60% |
| Pruning | High | High | Variable |
Start at the top. Work your way down. Test everything in shadow mode before production.
I’ll be honest with you. There’s no magic here. Just systematic measurement, incremental optimization, and the discipline to roll back when things break. That $33,000 monthly savings bought the startup eight additional months of runway, enough time to hit their growth targets and raise their Series A.
Your numbers will differ. Your models will behave differently. But the process? That’s reproducible. Start with the audit. You might be surprised by what you find.








