Sports & GamingMedia & Publishing

Two of India's largest digital platforms were hemorrhaging over $10 million annually. Not from bad product, but from broken incident response.

Intelligent incident response that saves millions of dollars annually

Bharat Kumar

20xfaster MTTR

99.95%uptime achieved

225xless noise

$10M+saved annually

The Problem

3,000–9,000 alerts a day. Nobody trusted any of them. Millions lost in the noise.

Both platforms had monitoring. Dashboards, on-call rotations, PagerDuty, the works. But when something actually broke, the process fell apart.

Alerts fired in every direction. Context was missing. The people who could fix things were never the ones who got paged.

During active incidents, 50–70% of the entire SRE team was consumed by firefighting: triaging noise, hunting context, chasing phantom alerts. Management pressed for revenue impact numbers while engineers drowned in chaos, with no clear signal of what actually mattered.

Alert Volume, 24h Window

Critical incidents marked. Lost in the noise.

Sports & Gaming

Media & Publishing

Sports & Gaming

A Leading Sports Gaming & Media Platform

Crores of concurrent users during IPL and major sporting events. Every second of downtime is lost revenue and eroded trust from millions of active players.

“During a semi-final match, a payments microservice went down for 47 minutes. The on-call engineer found out from his PM, who heard from his VP, who read about it on Twitter. Not their monitoring stack.”

Media & Publishing

A Major Digital News Publisher

24/7 breaking news operation serving millions of daily readers. When the site goes down during a developing story, readers leave and don't come back.

“During election night coverage, the CMS publishing pipeline silently failed. Editors kept writing stories that never went live. Nobody noticed for 2 hours.”

What We Plugged Into

Decision Core™. Every signal. One layer.

Riklr's Decision Core™ didn't replace their tools. It embedded into them.

We connected to the full engineering stack so that every signal, from every source, flows through a single intelligence layer.

All data in transit was encrypted, retained only as long as needed, and never left their compliance boundary.

Decision Core is built to meet the latest PII and data privacy standards out of the box.

SDLC Platforms

GitHub

Code changes, PR activity, deploy triggers

Jira

Ticket creation, escalation history

CI/CD

Build failures, deploy events, rollbacks

PagerDuty

On-call schedules, escalation policies

Riklr

Intelligence Layer

Data & Log Sources

CloudWatch & Datadog

Infra metrics, APM, distributed tracing

Splunk & App Logs

Log aggregation, custom instrumentation

Zendesk / CS Portals

Customer-reported issues, support ticket spikes

Twitter / X

Social signals, public incident detection

The Intelligent Response Engine

From signal to resolution. Seven steps. Zero panic.

The old model was reactive. Something breaks at 2am. SREs get paged into a storm of alerts.

They spend the first thirty minutes just figuring out which service is on fire and who owns it. Then the real scramble starts: hunting down a service owner who may or may not answer the phone.

Riklr replaced that entire chain with an intelligent, automated response engine.

Signal Detected

Unified signal ingestion across all connected sources: logs, metrics, deploys, tickets.

Classified & Prioritised

ML-driven severity scoring filters noise. Only real incidents surface.

Right Team Alerted

Context-rich notification: what broke, where, likely root cause. Routed to the correct owner.

AI Voice Triage

An AI agent calls affected service owners simultaneously. It asks targeted, context-aware questions, collates answers in real time, and begins driving the runbook before a human war room is even assembled.

Agentic SOP & Escalation

The agent executes SOPs autonomously, manages the escalation chain, and pulls in backup owners when primary contacts are unreachable. No one falls through the cracks.

Incident Report Captured

The full incident timeline, decisions, and resolution steps are written up automatically. A draft report is generated and routed to the service owner for review and sign-off.

Repeat Pattern Flagged

System identifies prior occurrences and surfaces the previous resolution path.

Signal Detected

Unified signal ingestion across all connected sources: logs, metrics, deploys, tickets.

Classified & Prioritised

ML-driven severity scoring filters noise. Only real incidents surface.

Right Team Alerted

Context-rich notification: what broke, where, likely root cause. Routed to the correct owner.

AI Voice Triage

Agentic SOP & Escalation

The agent executes SOPs autonomously, manages the escalation chain, and pulls in backup owners when primary contacts are unreachable. No one falls through the cracks.

Incident Report Captured

The full incident timeline, decisions, and resolution steps are written up automatically. A draft report is generated and routed to the service owner for review and sign-off.

Repeat Pattern Flagged

System identifies prior occurrences and surfaces the previous resolution path.

SOP Evolution

Every fix makes the system smarter.

90%20%

Repeat incident rate: from 9 in 10 incidents being reruns, to just 1 in 5.

Every resolved incident feeds back into a living SOP library.

The next time something similar happens, the system already knows how it was fixed. It routes the responder straight to the proven resolution path.

1.Incident Occurs

2.Resolved & Recorded

3.SOP Auto-Updated

4.Next Occurrence Pre-empted

1Incident Occurs

2Resolved & Recorded

3SOP Auto-Updated

4Next Occurrence Pre-empted

Continuous loop

Cross-System Visibility

Not a dashboard. A trust layer.

Twelve browser tabs and a prayer is not an incident response strategy. Engineers were switching contexts constantly, losing time to tool-switching instead of fixing things.

Riklr surfaces one prioritised signal: the thing that actually needs attention right now, with the full context to act on it.

Forward-Deployed Engineering

One size fits all -- doesn't apply today.

Every company on this list had a different SDLC, a different data stack, and a different definition of what “visibility” meant to their teams. Pre-packaged dashboards weren't going to cut it.

Riklr's forward-deployed ML engineers and AI specialists embedded directly into each client's environment. They studied the pipelines, understood the workflows, and built agentic solutions on top of Decision Core™, not around it.

The magic wasn't the Decision Core™ alone. It was the tailored deployment.

Embedded expertise

ML engineers and AI specialists sat with their teams, not in a support queue. They learned how each client's stack behaved before they touched a single config.

Native to Decision Core™

Agentic workflows were built directly on Decision Core™. Not bridged. Not bolted on. Every customisation runs inside the same intelligence layer.

Signal over noise

Teams got the views, insights, and alerts built around their workflows. Not the defaults that come pre-checked and get ignored within a week.

Before

CloudWatch

CPU Util

Datadog

p99 Latency

GitHub

Deploys

Splunk

Errors

ERRupstream_connect_error 502

WRNretry 3/5 backoff 800ms

ERREPIPE socket hang up

INFreq received id=8f2a

Grafana

RPS

Slack

#oncall

Jira

Sprint

To do

Doing

Done

PagerDuty

Incidents

P0 · api-prod CPU 97%2m

P1 · auth latency 2.1s8m

P0 · db replica lag14m

Phone

152 Missed

VP Engineering

Mobile · Missed

Sarah (PM)

Mobile · Missed

PagerDuty Bot

Missed (21)

PagerDuty Bot

Missed (14)

22m

After

API

healthy

CDN

degraded

healthy

Auth

healthy

Active Incidents

3Down 40% from last week

System Uptime

30-day rolling

API

99.98%

CDN

99.82%

99.99%

Auth

99.95%

Before

CloudWatch

CPU Util

Datadog

p99 Latency

GitHub

Deploys

Splunk

Errors

ERRupstream_connect_error 502

WRNretry 3/5 backoff 800ms

ERREPIPE socket hang up

INFreq received id=8f2a

Grafana

RPS

Slack

#oncall

Jira

Sprint

To do

Doing

Done

PagerDuty

Incidents

P0 · api-prod CPU 97%2m

P1 · auth latency 2.1s8m

P0 · db replica lag14m

Phone

152 Missed

VP Engineering

Mobile · Missed

Sarah (PM)

Mobile · Missed

PagerDuty Bot

Missed (21)

PagerDuty Bot

Missed (14)

22m

After

API

healthy

CDN

degraded

healthy

Auth

healthy

Active Incidents

3Down 40% from last week

System Uptime

30-day rolling

API

99.98%

CDN

99.82%

99.99%

Auth

99.95%

Results

Every metric improved. The systems felt it.

Faster resolution meant fewer cascading failures.

Less noise meant engineers focused on what actually mattered.

Higher uptime meant the platforms held under pressure. The business ran without interruption.

Mean Time to Resolution

Before4 hours

After12 minutes

20x fasterimprovement

Repeat Incident Rate

Before90%

After20%

78pp reductionimprovement

Platform Uptime

Before97.5%

After99.95%

+2.45ppimprovement

Daily Actionable Alerts

Before3K–9K

After40–60

Up to 225x less noiseimprovement

The $10M Breakdown

Where the money comes back from.

MTTR dropped 20x.

Revenue that would have bled out during peak-event outages stayed on the platform.

The SRE team, previously consumed by triage and war rooms, was redeployed to build.

Alert volume fell from thousands a day to 40–60 actionable signals. Engineers stopped chasing noise.

Redundant tooling and external incident response retainers were cut entirely.

Here's where that adds up to $10M+.

$10M+

Downtime Revenue Protected

69%$7M

Engineering Hours Recovered

25%$2.6M

Tooling & Process Savings

6%$0.6M

What This Proves

The problem isn't unique. The solution is proven.

Every large digital platform has the same broken incident response. Too many alerts, too little context, and far too slow to act. The tooling already exists. The data already exists. What has always been missing is the intelligence layer that ties it all together.

That is what Decision Core™ does. It sits on top of the observability, ticketing, deploy and communication tools that already run the business, reads every signal in context, and turns raw telemetry into decisions a responder can actually act on. Nothing gets ripped out. Nothing gets replaced. Existing tooling and data sources become sharper because an intelligence layer is finally making sense of them together.

But software alone did not deliver these results. Our forward-deployed engineers sat inside each client's on-call rotation, studied how their systems actually failed, and tuned Decision Core to their stack, their runbooks, and their people. The platform brought the intelligence. The FDE team brought the fit. Together they turned a generic capability into a deployment that felt native on day one.

And this is not just a lean-team problem. As engineering organisations scale, coordination overhead compounds. More services, more owners, more handoffs, and more places for an incident to fall through the cracks. Riklr scales with the team, keeping response sharp regardless of org size, because the intelligence layer absorbs the complexity that humans would otherwise have to carry.

“
Before Riklr, our on-call engineers spent more time figuring out what broke than actually fixing it. Now the system hands them a diagnosis. War rooms are shorter. Postmortems write themselves. We've shipped more in the last two quarters than the previous six.
CTO, A Leading Sports Gaming & Media Platform

Working through the same problem? Reach out.