Software Maintenance OnCall Engineering Culture · May 2026

Being OnCall.
Expect the unexpected,
while hoping for none.

OnCall is one of the most critical, most adhoc, and most underappreciated functions in software engineering. This is an attempt to put structure around it — not to make it rigid, but to make it survivable. These are two cents from someone who has lived it. I hope they narrow, even slightly, how you think about OnCall responsibilities.

PD
Pradeep A. Dalvi
Computer Engineer · Distributed Systems & Payments
11 min read
24/7 Systems don’t observe business hours. Neither do the failures.
1st Line of defence against production incidents — that’s you, the OnCall engineer
BAU The baseline you must know cold before you can recognise what isn’t it
0 Times it is acceptable to skip a successful handover

01

What OnCall Actually Is

OnCall accounts for one of the most critical functions in ensuring BAU activities continue uninterrupted. It provides direct, unfiltered observability into the systems you work on — the kind that design documents and code reviews simply cannot replicate. It surfaces unanticipated gaps in well-designed, well-developed systems that only reveal themselves at scale and over time.

And yet it is one of the lesser appreciated, sometimes highly neglected activities among engineers — crowded out by the chase for new features and greenfield development, while the ever-changing factors of time and scale quietly accumulate risk. Outdating hardware. Growing tech debt. Increasing volume. These do not file tickets. They wait for OnCall to meet them.

“OnCall is the window when one has to be mentally prepared to expect the unexpected,
while deep down hoping it to be uneventful.”

The gap between expecting the unexpected and being unprepared for it is what this article tries to close. It is not a rigid prescription — it is a skeleton. You will add flesh to it from your own systems, your own stack, your own hard-won BAU baselines.


02

The Daily Checklist

The checklist is not a form you fill when you go OnCall. It is a living document you maintain whether you are OnCall or not — built incrementally, continuously calibrated against what you have seen and what others have learned the hard way.

Four principles for a checklist that actually works

1
build it before you need it

Prepare the checklist of systems you are responsible for and keep adding metrics over time, whether you are OnCall or not. A checklist written during an incident is a wish list, not a tool.

2
learn from others’ incidents

Every post-mortem — yours or another team’s — is a free lesson in what your checklist is missing. Improve it periodically, not just after your own misses. Borrowed experience is cheaper than earned pain.

3
always choose to fall forward

Learn to identify gaps quickly and provide immediate mitigation over rollback. Rollback is not always the safest or fastest path. Forward mitigation with eyes open beats blind reversal.

4
divide metrics into focused areas

Most metrics will fall into one or more of four buckets. Structure your checklist around them — not by system, but by what the metric means when it moves.

01 / tech

Tech Metrics

Error rates, latency percentiles, queue depths, cache hit ratios, resource saturation. The health of the infrastructure beneath everything else.

02 / business

Business Metrics

Transaction success rates, conversion, throughput of critical user flows. Where engineering health translates directly into customer experience and revenue.

03 / growth

Growth Metrics

Volume trends, scale headroom, capacity runway. The signals that tell you whether today’s system can handle next month’s load.

04 / compliance

Compliance Metrics

Regulatory SLAs, audit trail completeness, data retention adherence. Non-negotiable, and rarely loud until they are already in breach.


03

BAU Activities

BAU — Business As Usual — is not background noise. It is the baseline that everything else is measured against. OnCall’s first responsibility is to ensure BAU runs, runs correctly, and runs on time. This means owning three categories of routine work:

tech jobs
Periodic Tech Jobs

Reconciliations, workflow terminal state management, data consistency checks, scheduled maintenance tasks. These are not glamorous. They are, however, completely non-negotiable from the customer experience point of view. Periodic tech jobs build trust with customers, maintain end-user relationships, and effectively build the brand over time. Taking them lightly is completely disastrous in the long term — the damage accumulates invisibly until it becomes visible all at once, and usually at the worst possible moment.

business jobs
Business-Related Jobs

Reporting pipelines, SLA-bound processing windows, partner and vendor reconciliations. These tie your systems’ runtime behaviour directly to commercial commitments. Failures here are felt by people outside engineering before anyone files a ticket.

compliance jobs
Compliance-Related Jobs

Regulatory reporting, audit log completeness, data lifecycle enforcement. These run on external clocks, not engineering ones. A missed compliance window does not have a mitigation path. It has a regulator.

The BAU principle

You cannot detect an anomaly until you know what normal looks like. Owning BAU is not a lesser task than responding to incidents. It is the precondition for recognising them.


04

Anomalies

anomaly, n.
Something that deviates from what is standard, normal, or expected. To detect or accept something as an anomaly, one must first have in-depth understanding of what BAU looks like. Hence it is very important to understand the first two responsibilities very clearly before engaging with this one.

Anomalies arrive in the same three categories as BAU — because that is where normal lives, and where deviations from it first become visible.

T
tech anomalies

Unexpected error rate spikes, latency shifts, resource saturation events, job failures, dependency degradation. Often the first signal, because the system always tells you before the customer does — if you are watching.

B
business anomalies

Transaction success rates outside historical range, throughput drops, processing delays that breach SLAs. These are where tech anomalies become customer-visible. The gap between T and B is your detection window.

C
compliance anomalies

Regulatory reporting gaps, audit trail holes, retention breaches. The most expensive anomalies to resolve after the fact. Preventable almost entirely through consistent BAU monitoring.

The most effective anomaly detection is not reactive alerting — it is the accumulated intuition of an engineer who knows their system’s normal so well that something off feels wrong before the alert fires. That intuition is built through BAU ownership, not through incident response alone.

Questions to ask when something looks anomalous
  • Is this a deviation from normal, or is my normal baseline stale?
  • When did this start? Is there a correlated deployment, config change, or traffic event?
  • Is this isolated to one system, or spreading across a dependency boundary?
  • Is the impact currently tech-only, or has it reached business or compliance metrics?
  • Has this pattern appeared before? Is there a prior post-mortem or runbook entry?
  • What is the blast radius if this continues for another 30 minutes without intervention?

05

Unanticipated Issues & Mitigation

Even with a complete checklist, a deep BAU baseline, and sharp anomaly detection — the unanticipated still arrives. That is the definition of OnCall. The goal is not to eliminate surprise. It is to reduce the time between surprise and controlled response.

Mitigation sequence — from first signal to resolution

1
detect & triage

Confirm the signal is real, not a monitoring artefact. Establish impact scope immediately: is this tech-only, business-impacting, or compliance-touching? The triage answer determines everything that follows.

2
communicate early

Notify stakeholders at the start of investigation, not after resolution. A short, factual early message — “investigating an anomaly in X, impact Y, update in Z mins”costs thirty seconds and prevents thirty minutes of escalation noise.

3
immediate mitigation over rollback

Identify the fastest path to a stable state. This is frequently not a rollback. Always choose to fall forward — a targeted mitigation with known trade-offs beats an untested rollback that may introduce new unknowns.

4
drive the team, not just the ticket

OnCall’s primary responsibility during an incident is not to fix everything alone. It is to be the primary point of contact and to drive the team toward immediate mitigation steps — assigning, unblocking, and keeping the thread moving.

5
document action items before closing

Every unanticipated issue leaves behind pending action items: permanent fixes, runbook updates, alerting improvements, tech debt logged. Closing an incident without capturing these is how the same issue becomes next rotation’s problem.


06

Tech Debt & Improving Observability

OnCall is the best source of signal on where your system is quietly accumulating debt. Not the kind that shows up in a sprint retro — the kind that only reveals itself under load, at 3 AM, on a Friday. Every unanticipated issue is an X-ray of a place where the system’s design assumptions no longer match reality.

Tech debt surfaced by OnCall

Issues that appear periodically and are manually resolved each time are not BAU. They are tech debt wearing a BAU costume. If something can be automated, automate it. If it recurs on a cadence, it deserves a permanent fix. The classification matters: a periodic manual task is either accepted operational work or unresolved engineering debt. Name it correctly.

Improving metrics

After every non-trivial incident, ask: what metric would have told us this was coming ten minutes earlier? Add it. Observability is not a project with a completion date. It is a continuous refinement driven by everything the system has surprised you with.

Improving alerting

An alert that fires too late is a post-mortem input. An alert that fires too early is noise that erodes trust in the alerting system until engineers start ignoring it. After every incident, audit the alert that fired (or didn’t): was the threshold right? Was the right person notified? Was the runbook linked? Was the severity accurate?

Observability improvement questions after every incident
  • What metric would have given earlier warning? Does it exist? If not, add it.
  • Did the alert fire at the right time, or after the customer was already affected?
  • Was the alert severity accurate relative to the actual business impact?
  • Is the runbook linked from the alert? Is the runbook current?
  • Could this issue have been automated away? What is the effort vs. recurrence cost?
  • Is this tech debt? Log it. Name it. Prioritise it honestly against feature work.

“Issues solved permanently improve the system.
Issues solved periodically improve the engineer.
Issues solved repeatedly improve nobody.”


07

The Handover

One of the most crucial — and most frequently underestimated — aspects of OnCall duties is the handover. A successful handover is not a formality. It is the final act of ownership. It is what transforms your OnCall rotation from an isolated experience into institutional memory.

Handover component What to include Why it matters
BAU status State of all periodic jobs, any deviations resolved or ongoing Incoming OnCall must know the baseline before they can detect the anomaly
Open anomalies Active investigations, suspected causes, current mitigations in place Prevents duplicate investigation and cold-start delay mid-incident
Pending action items Items that need follow-up: permanent fixes, runbook updates, escalations Ensures issues that occurred on your watch don’t silently disappear
Tech debt logged Any recurring issues or gaps identified during the rotation Converts OnCall signal into engineering backlog before it is forgotten
Observability changes New alerts added, thresholds tuned, dashboards updated Incoming OnCall should know what has changed in their monitoring landscape
System mood Brief qualitative assessment: quiet, elevated, watchful No dashboard captures the gut feel of an engineer who has just spent 12 hours watching a system

While some OnCall duties can be offloaded to a dedicated team, the primary responsibility of being the primary point of contact cannot be delegated away. Not just attentiveness to unexpected activity — but actively driving the team toward immediate mitigation. That responsibility is yours for the duration of the rotation, and it transfers cleanly only through a thorough handover.

The OnCall loop
  • Checklist → know your baseline before anything else.
  • BAU → keep the normal running so you can recognise the abnormal.
  • Anomalies → detect deviations early; investigate with structure.
  • Unanticipated issues → triage fast, communicate early, fall forward.
  • Tech debt & observability → convert every surprise into a permanent improvement.
  • Handover → transfer state, not just the pager.

OnCall is not a tax on engineering time. It is the feedback loop that makes systems — and engineers — better. The checklist, the BAU discipline, the anomaly literacy, the mitigation instinct, the observability habit, the clean handover: none of these are glamorous. All of them compound.

And finally — hope for a smooth and uneventful OnCall. 🤞

If you lead engineering teams, run distributed systems at scale, or think seriously about engineering culture and maintenance practices — I’d love to connect.

linkedin.com/in/pradeep  ·  Pune, India