Case Study

“What gets measuredgets managed.”

— Peter Drucker

We mapped every dollar of a $100M+ cloud bill, found the waste, built the fix, and automated the whole thing.

Impact at a glance

$10M+

saved per year

Cost / $ revenue

$0.019$0.013

Cost / user / day

$3.8$2.8
99.97%
Peak-event SLA
0
Capacity incidents during peak events
1,840 hrs
Engineering hours recovered / yr
Data Foundation

Decision Core sits where your data already lives.

No new pipelines. No rip-and-replace. Decision Core ingests signals from every tool your team already uses, processes them in real time, and surfaces every observable insight automatically.

Telemetry

Agreements & Contracts

Cloud Cost Data

Technical Documentation

Riklr

Decision Core

Cost by service

ComputeDatabaseCacheNetworkStorage

Unit cost trend

BeforeAfter

Demand signals

Event window

Infra utilisation

ComputeDatabaseCacheNetworkStorageUtilisation intensity ↑

Unit cost

Savings Programme

Identified multiple levers projected to save ~$11M annually.

The engagement opened with a full opportunity mapping exercise. Every potential savings initiative across the stack was surfaced, modelled against 12 months of billing data, and ranked by savings realised per unit of engineering effort. Only the highest-leverage levers made the programme.

01

Prediction-based Scaling

Scale infrastructure ahead of demand, not in reaction to it. Decision Core ingests event schedules and live signals to provision capacity before traffic arrives, eliminating both over-provisioning and lag-cost.

$6.5M

projected / yr

59%
02

Instance Reservation

Commit the right baseline of capacity at reserved rates. ML models predict sustained-use patterns with enough accuracy to shift workloads off on-demand pricing without sacrificing headroom.

$2.7M

projected / yr

25%
03

Additional Initiatives

A portfolio of targeted fixes surfaced during the diagnostic phase, individually smaller but collectively material.

Unused infra removalLog retention reductionInstance family upgrades

$1.8M

projected / yr

16%
Total projected savings
~$11M/ yr

Projections were validated against 12 months of billing data before engagement. Realised savings of $10M+ were delivered within the first operating year.

Lever 1: Prediction-based Scaling

Traffic spikes in a minute.
Infrastructure takes 15–20.

Reactive autoscaling was built for gradual load growth, not for a live sports platform where a single whistle can push concurrent users from 1M to 20M in under 60 seconds. By the time new instances are provisioned and warmed up, the match moment has already passed.

Concurrent users
Reactive infra capacity
05M10M15M20M25M-30m-20m-10mt=0+10m+20m+30m15–20 min lag20M usersInfra baselinenew baseline

< 60 sec

Time for traffic to spike to 20M concurrent users

15–20 min

Time for new instances to provision, boot, and warm up

Every spike

Reactive infra arrives late. The window has already moved on.

The Two-Mode Problem

Without prediction, the system keptfalling into two bad operating modes.

Engineering teams were trapped between two equally unsustainable choices. Neither was a strategy. Both were symptoms of running infrastructure without any view of what demand would do next.

Mode 01

Over-provisioned for safety

Burned by past incidents, teams ran 2–3× expected peak headroom and held it permanently. Outside active match windows, the majority of every day, that capacity sat completely idle.

~50%

avg idle

Provisioned
Actual traffic

The waste wasn't visible on any single day. It accumulated invisibly across thousands of hours of off-peak idle time every year.

Mode 02

Reactive, always 15 min late

The autoscaler fired on CPU and memory thresholds. By the time those signals appeared, new instances provisioned, and caches warmed, the match moment had already passed.

15 min

too late

Traffic demand
Infra capacity

The autoscaler was designed for gradual ramp-ups, not a system where a referee's whistle can move a million concurrent users in under 60 seconds.

Both modes share the same root cause: infrastructure decisions made without any knowledge of what demand will do next. The fix wasn't faster alerting or a better autoscaler. It was replacing the reactive loop entirely with a prediction-first architecture.

Autonomous Control

After prediction came quiet, always-on execution.

Riklr agents continuously consumed prediction events, checked policy boundaries, and executed bounded scale actions in the background. The operating model became calmer precisely because the system stopped waiting for humans to catch up.

15–20 min

The old reactive lag, eliminated. The loop now acts 15–30 min ahead.

1

observe

Observe

Event calendarLive demandStack posturePolicy windows

Watch live demand, event signals, and current stack posture continuously.

2

predict

Predict

NOWT−30T±0T+30
ActualForecast

Project the next operating horizon rather than reacting to the present minute.

3

act

Act

Apps↑ Scaling
Cache↑ Scaling
DB↑ Scaling
Scale action issued within guardrails

Issue scale-up or scale-down actions across the relevant systems inside policy guardrails.

4

stable

Stable

ForecastScalingDemand

Return to low-friction monitoring with fewer manual escalations and less wasted capacity.

ObservePredictActStable
Lever 2: Instance Reservation

Reservations are a living portfolio,not a one-time procurement decision.

Every reserved instance carries an expiry date. As batches expire, total covered capacity drops, and anything uncovered defaults to on-demand rates at the worst possible moment. Decision Core tracks the full reservation portfolio, forecasts upcoming demand, and triggers renewals and resizes before each expiry window closes.

Projected annual savings

$2.7M

From shifting baseline capacity to reserved pricing, right-sized continuously by ML forecast

Rate reduction vs on-demand

~30%

Reserved instances cost significantly less if you commit the right size at the right time

Uncovered expiry windows

0

The system renews before every expiry. No gap where capacity reverts to on-demand billing.

ML-Driven Reservation Portfolio

12 months of continuous portfolio management. Reservations are bought before peaks, renewed before expiry, and right-sized each cycle based on the latest demand forecast. The staircase pattern is the system working: each step up is a purchase, each step down an expiry.

8

portfolio events / yr

≤ 5%

headroom over demand

0

uncovered windows

~30%

rate saving on floor

20406080100−6+16−8+22−10+8−14−8JanFebMarAprMayJunJulAugSepOctNovDecJan# Instances
Total reserved capacity
Actual demand (avg)
Optimum reservation
Purchase event (+N)
Expiry event (−N)

The key insight: The gap between reserved capacity and actual demand is intentionally tight: wide enough to absorb forecast error, narrow enough to avoid idle waste. Getting that gap right, across every renewal cycle, is the optimisation. A static floor set once a year cannot do this.

Key outcomes

$10M+saved annually
99.97%peak-event SLA
0capacity incidents
20×faster scaling response
1,840 hrsengineering time recovered / yr

Seeing the same patterns in your infrastructure?

We work with engineering and platform teams to instrument cloud spend, model demand ahead of events, and build the automation layer that removes manual toil. If any of this resonates, let’s talk.