Case Studies
Sports & GamingMedia & Publishing

Two of India's largest digital platforms were hemorrhaging over $10 million annually. Not from bad product, but from broken incident response.

Intelligent incident response that saves millions of dollars annually

Bharat KumarBharat Kumar
20xfaster MTTR
99.95%uptime achieved
225xless noise
$10M+saved annually

The Problem

3,000–9,000 alerts a day. Nobody trusted any of them. Millions lost in the noise.

Both platforms had monitoring. Dashboards, on-call rotations, PagerDuty, the works. But when something actually broke, the process fell apart.

Alerts fired in every direction. Context was missing. The people who could fix things were never the ones who got paged.

During active incidents, 50–70% of the entire SRE team was consumed by firefighting: triaging noise, hunting context, chasing phantom alerts. Management pressed for revenue impact numbers while engineers drowned in chaos, with no clear signal of what actually mattered.

Alert Volume, 24h Window

Critical incidents marked. Lost in the noise.

Sports & Gaming
Media & Publishing
02040608000:0004:0008:0012:0016:0020:0000:00alerts / 10 minPayment timeoutsMISSEDMatch data sync failureMISSEDCDN cache corruptionMISSEDCMS pipeline stallMISSEDDB connection pool limitMISSED1h 20mresolved1h 45mresolved2h 30mresolved1h 40mresolved2hresolved
Sports & Gaming

A Leading Sports Gaming & Media Platform

Crores of concurrent users during IPL and major sporting events. Every second of downtime is lost revenue and eroded trust from millions of active players.

During a semi-final match, a payments microservice went down for 47 minutes. The on-call engineer found out from his PM, who heard from his VP, who read about it on Twitter. Not their monitoring stack.
Media & Publishing

A Major Digital News Publisher

24/7 breaking news operation serving millions of daily readers. When the site goes down during a developing story, readers leave and don't come back.

During election night coverage, the CMS publishing pipeline silently failed. Editors kept writing stories that never went live. Nobody noticed for 2 hours.
What We Plugged Into

Decision Core™. Every signal. One layer.

Riklr's Decision Core™ didn't replace their tools. It embedded into them.

We connected to the full engineering stack so that every signal, from every source, flows through a single intelligence layer.

All data in transit was encrypted, retained only as long as needed, and never left their compliance boundary.

Decision Core is built to meet the latest PII and data privacy standards out of the box.

SDLC Platforms

GitHub

Code changes, PR activity, deploy triggers

Jira

Ticket creation, escalation history

CI/CD

Build failures, deploy events, rollbacks

PagerDuty

On-call schedules, escalation policies

Riklr

Intelligence Layer

Data & Log Sources

CloudWatch & Datadog

Infra metrics, APM, distributed tracing

Splunk & App Logs

Log aggregation, custom instrumentation

Zendesk / CS Portals

Customer-reported issues, support ticket spikes

Twitter / X

Social signals, public incident detection

The Intelligent Response Engine

From signal to resolution. Seven steps. Zero panic.

The old model was reactive. Something breaks at 2am. SREs get paged into a storm of alerts.

They spend the first thirty minutes just figuring out which service is on fire and who owns it. Then the real scramble starts: hunting down a service owner who may or may not answer the phone.

Riklr replaced that entire chain with an intelligent, automated response engine.

1

Signal Detected

Unified signal ingestion across all connected sources: logs, metrics, deploys, tickets.

2

Classified & Prioritised

ML-driven severity scoring filters noise. Only real incidents surface.

3

Right Team Alerted

Context-rich notification: what broke, where, likely root cause. Routed to the correct owner.

4

AI Voice Triage

An AI agent calls affected service owners simultaneously. It asks targeted, context-aware questions, collates answers in real time, and begins driving the runbook before a human war room is even assembled.

5

Agentic SOP & Escalation

The agent executes SOPs autonomously, manages the escalation chain, and pulls in backup owners when primary contacts are unreachable. No one falls through the cracks.

6

Incident Report Captured

The full incident timeline, decisions, and resolution steps are written up automatically. A draft report is generated and routed to the service owner for review and sign-off.

7

Repeat Pattern Flagged

System identifies prior occurrences and surfaces the previous resolution path.

SOP Evolution

Every fix makes the system smarter.

90%20%

Repeat incident rate: from 9 in 10 incidents being reruns, to just 1 in 5.

Every resolved incident feeds back into a living SOP library.

The next time something similar happens, the system already knows how it was fixed. It routes the responder straight to the proven resolution path.

1Incident Occurs
2Resolved & Recorded
3SOP Auto-Updated
4Next Occurrence Pre-empted
Continuous loop

Cross-System Visibility

Not a dashboard. A trust layer.

Twelve browser tabs and a prayer is not an incident response strategy. Engineers were switching contexts constantly, losing time to tool-switching instead of fixing things.

Riklr surfaces one prioritised signal: the thing that actually needs attention right now, with the full context to act on it.

Forward-Deployed Engineering

One size fits all -- doesn't apply today.

Every company on this list had a different SDLC, a different data stack, and a different definition of what “visibility” meant to their teams. Pre-packaged dashboards weren't going to cut it.

Riklr's forward-deployed ML engineers and AI specialists embedded directly into each client's environment. They studied the pipelines, understood the workflows, and built agentic solutions on top of Decision Core™, not around it.

The magic wasn't the Decision Core™ alone. It was the tailored deployment.

Embedded expertise

ML engineers and AI specialists sat with their teams, not in a support queue. They learned how each client's stack behaved before they touched a single config.

Native to Decision Core™

Agentic workflows were built directly on Decision Core™. Not bridged. Not bolted on. Every customisation runs inside the same intelligence layer.

Signal over noise

Teams got the views, insights, and alerts built around their workflows. Not the defaults that come pre-checked and get ignored within a week.

Before

CloudWatch
CPU Util
Datadog
p99 Latency
GitHub
Deploys
Splunk
Errors
ERRupstream_connect_error 502
WRNretry 3/5 backoff 800ms
ERREPIPE socket hang up
INFreq received id=8f2a
Grafana
RPS
Slack
#oncall
Jira
Sprint
To do
Doing
Done
PagerDuty
Incidents
P0 · api-prod CPU 97%2m
P1 · auth latency 2.1s8m
P0 · db replica lag14m
Phone
152 Missed
VP Engineering
Mobile · Missed
1m
Sarah (PM)
Mobile · Missed
4m
PagerDuty Bot
Missed (21)
9m
PagerDuty Bot
Missed (14)
22m

After

API
healthy
CDN
degraded
DB
healthy
Auth
healthy

Active Incidents

3Down 40% from last week

System Uptime

30-day rolling
API
99.98%
CDN
99.82%
DB
99.99%
Auth
99.95%

Results

Every metric improved. The systems felt it.

Faster resolution meant fewer cascading failures.

Less noise meant engineers focused on what actually mattered.

Higher uptime meant the platforms held under pressure. The business ran without interruption.

Mean Time to Resolution

Before4 hours
After12 minutes
20x fasterimprovement

Repeat Incident Rate

Before90%
After20%
78pp reductionimprovement

Platform Uptime

Before97.5%
After99.95%
+2.45ppimprovement

Daily Actionable Alerts

Before3K–9K
After40–60
Up to 225x less noiseimprovement

The $10M Breakdown

Where the money comes back from.

MTTR dropped 20x.

Revenue that would have bled out during peak-event outages stayed on the platform.

The SRE team, previously consumed by triage and war rooms, was redeployed to build.

Alert volume fell from thousands a day to 40–60 actionable signals. Engineers stopped chasing noise.

Redundant tooling and external incident response retainers were cut entirely.

Here's where that adds up to $10M+.

$10M+

Downtime Revenue Protected
69%$7M
Engineering Hours Recovered
25%$2.6M
Tooling & Process Savings
6%$0.6M

What This Proves

The problem isn't unique. The solution is proven.

Every large digital platform has the same broken incident response. Too many alerts, too little context, and far too slow to act. The tooling already exists. The data already exists. What has always been missing is the intelligence layer that ties it all together.

That is what Decision Core™ does. It sits on top of the observability, ticketing, deploy and communication tools that already run the business, reads every signal in context, and turns raw telemetry into decisions a responder can actually act on. Nothing gets ripped out. Nothing gets replaced. Existing tooling and data sources become sharper because an intelligence layer is finally making sense of them together.

But software alone did not deliver these results. Our forward-deployed engineers sat inside each client's on-call rotation, studied how their systems actually failed, and tuned Decision Core to their stack, their runbooks, and their people. The platform brought the intelligence. The FDE team brought the fit. Together they turned a generic capability into a deployment that felt native on day one.

And this is not just a lean-team problem. As engineering organisations scale, coordination overhead compounds. More services, more owners, more handoffs, and more places for an incident to fall through the cracks. Riklr scales with the team, keeping response sharp regardless of org size, because the intelligence layer absorbs the complexity that humans would otherwise have to carry.

Before Riklr, our on-call engineers spent more time figuring out what broke than actually fixing it. Now the system hands them a diagnosis. War rooms are shorter. Postmortems write themselves. We've shipped more in the last two quarters than the previous six.

CTO, A Leading Sports Gaming & Media Platform
Working through the same problem? Reach out.