What OnCall Actually Is
OnCall accounts for one of the most critical functions in ensuring BAU activities continue uninterrupted. It provides direct, unfiltered observability into the systems you work on — the kind that design documents and code reviews simply cannot replicate. It surfaces unanticipated gaps in well-designed, well-developed systems that only reveal themselves at scale and over time.
And yet it is one of the lesser appreciated, sometimes highly neglected activities among engineers — crowded out by the chase for new features and greenfield development, while the ever-changing factors of time and scale quietly accumulate risk. Outdating hardware. Growing tech debt. Increasing volume. These do not file tickets. They wait for OnCall to meet them.
“OnCall is the window when one has to be mentally prepared to expect the unexpected,
while deep down hoping it to be uneventful.”
The gap between expecting the unexpected and being unprepared for it is what this article tries to close. It is not a rigid prescription — it is a skeleton. You will add flesh to it from your own systems, your own stack, your own hard-won BAU baselines.
The Daily Checklist
The checklist is not a form you fill when you go OnCall. It is a living document you maintain whether you are OnCall or not — built incrementally, continuously calibrated against what you have seen and what others have learned the hard way.
Four principles for a checklist that actually works
Prepare the checklist of systems you are responsible for and keep adding metrics over time, whether you are OnCall or not. A checklist written during an incident is a wish list, not a tool.
Every post-mortem — yours or another team’s — is a free lesson in what your checklist is missing. Improve it periodically, not just after your own misses. Borrowed experience is cheaper than earned pain.
Learn to identify gaps quickly and provide immediate mitigation over rollback. Rollback is not always the safest or fastest path. Forward mitigation with eyes open beats blind reversal.
Most metrics will fall into one or more of four buckets. Structure your checklist around them — not by system, but by what the metric means when it moves.
Tech Metrics
Error rates, latency percentiles, queue depths, cache hit ratios, resource saturation. The health of the infrastructure beneath everything else.
Business Metrics
Transaction success rates, conversion, throughput of critical user flows. Where engineering health translates directly into customer experience and revenue.
Growth Metrics
Volume trends, scale headroom, capacity runway. The signals that tell you whether today’s system can handle next month’s load.
Compliance Metrics
Regulatory SLAs, audit trail completeness, data retention adherence. Non-negotiable, and rarely loud until they are already in breach.
BAU Activities
BAU — Business As Usual — is not background noise. It is the baseline that everything else is measured against. OnCall’s first responsibility is to ensure BAU runs, runs correctly, and runs on time. This means owning three categories of routine work:
You cannot detect an anomaly until you know what normal looks like. Owning BAU is not a lesser task than responding to incidents. It is the precondition for recognising them.
Anomalies
Anomalies arrive in the same three categories as BAU — because that is where normal lives, and where deviations from it first become visible.
Unexpected error rate spikes, latency shifts, resource saturation events, job failures, dependency degradation. Often the first signal, because the system always tells you before the customer does — if you are watching.
Transaction success rates outside historical range, throughput drops, processing delays that breach SLAs. These are where tech anomalies become customer-visible. The gap between T and B is your detection window.
Regulatory reporting gaps, audit trail holes, retention breaches. The most expensive anomalies to resolve after the fact. Preventable almost entirely through consistent BAU monitoring.
The most effective anomaly detection is not reactive alerting — it is the accumulated intuition of an engineer who knows their system’s normal so well that something off feels wrong before the alert fires. That intuition is built through BAU ownership, not through incident response alone.
- Is this a deviation from normal, or is my normal baseline stale?
- When did this start? Is there a correlated deployment, config change, or traffic event?
- Is this isolated to one system, or spreading across a dependency boundary?
- Is the impact currently tech-only, or has it reached business or compliance metrics?
- Has this pattern appeared before? Is there a prior post-mortem or runbook entry?
- What is the blast radius if this continues for another 30 minutes without intervention?
Unanticipated Issues & Mitigation
Even with a complete checklist, a deep BAU baseline, and sharp anomaly detection — the unanticipated still arrives. That is the definition of OnCall. The goal is not to eliminate surprise. It is to reduce the time between surprise and controlled response.
Mitigation sequence — from first signal to resolution
Confirm the signal is real, not a monitoring artefact. Establish impact scope immediately: is this tech-only, business-impacting, or compliance-touching? The triage answer determines everything that follows.
Notify stakeholders at the start of investigation, not after resolution. A short, factual early message — “investigating an anomaly in X, impact Y, update in Z mins” — costs thirty seconds and prevents thirty minutes of escalation noise.
Identify the fastest path to a stable state. This is frequently not a rollback. Always choose to fall forward — a targeted mitigation with known trade-offs beats an untested rollback that may introduce new unknowns.
OnCall’s primary responsibility during an incident is not to fix everything alone. It is to be the primary point of contact and to drive the team toward immediate mitigation steps — assigning, unblocking, and keeping the thread moving.
Every unanticipated issue leaves behind pending action items: permanent fixes, runbook updates, alerting improvements, tech debt logged. Closing an incident without capturing these is how the same issue becomes next rotation’s problem.
Tech Debt & Improving Observability
OnCall is the best source of signal on where your system is quietly accumulating debt. Not the kind that shows up in a sprint retro — the kind that only reveals itself under load, at 3 AM, on a Friday. Every unanticipated issue is an X-ray of a place where the system’s design assumptions no longer match reality.
Tech debt surfaced by OnCall
Issues that appear periodically and are manually resolved each time are not BAU. They are tech debt wearing a BAU costume. If something can be automated, automate it. If it recurs on a cadence, it deserves a permanent fix. The classification matters: a periodic manual task is either accepted operational work or unresolved engineering debt. Name it correctly.
Improving metrics
After every non-trivial incident, ask: what metric would have told us this was coming ten minutes earlier? Add it. Observability is not a project with a completion date. It is a continuous refinement driven by everything the system has surprised you with.
Improving alerting
An alert that fires too late is a post-mortem input. An alert that fires too early is noise that erodes trust in the alerting system until engineers start ignoring it. After every incident, audit the alert that fired (or didn’t): was the threshold right? Was the right person notified? Was the runbook linked? Was the severity accurate?
- What metric would have given earlier warning? Does it exist? If not, add it.
- Did the alert fire at the right time, or after the customer was already affected?
- Was the alert severity accurate relative to the actual business impact?
- Is the runbook linked from the alert? Is the runbook current?
- Could this issue have been automated away? What is the effort vs. recurrence cost?
- Is this tech debt? Log it. Name it. Prioritise it honestly against feature work.
“Issues solved permanently improve the system.
Issues solved periodically improve the engineer.
Issues solved repeatedly improve nobody.”
The Handover
One of the most crucial — and most frequently underestimated — aspects of OnCall duties is the handover. A successful handover is not a formality. It is the final act of ownership. It is what transforms your OnCall rotation from an isolated experience into institutional memory.
| Handover component | What to include | Why it matters |
|---|---|---|
| BAU status | State of all periodic jobs, any deviations resolved or ongoing | Incoming OnCall must know the baseline before they can detect the anomaly |
| Open anomalies | Active investigations, suspected causes, current mitigations in place | Prevents duplicate investigation and cold-start delay mid-incident |
| Pending action items | Items that need follow-up: permanent fixes, runbook updates, escalations | Ensures issues that occurred on your watch don’t silently disappear |
| Tech debt logged | Any recurring issues or gaps identified during the rotation | Converts OnCall signal into engineering backlog before it is forgotten |
| Observability changes | New alerts added, thresholds tuned, dashboards updated | Incoming OnCall should know what has changed in their monitoring landscape |
| System mood | Brief qualitative assessment: quiet, elevated, watchful | No dashboard captures the gut feel of an engineer who has just spent 12 hours watching a system |
While some OnCall duties can be offloaded to a dedicated team, the primary responsibility of being the primary point of contact cannot be delegated away. Not just attentiveness to unexpected activity — but actively driving the team toward immediate mitigation. That responsibility is yours for the duration of the rotation, and it transfers cleanly only through a thorough handover.
- Checklist → know your baseline before anything else.
- BAU → keep the normal running so you can recognise the abnormal.
- Anomalies → detect deviations early; investigate with structure.
- Unanticipated issues → triage fast, communicate early, fall forward.
- Tech debt & observability → convert every surprise into a permanent improvement.
- Handover → transfer state, not just the pager.
OnCall is not a tax on engineering time. It is the feedback loop that makes systems — and engineers — better. The checklist, the BAU discipline, the anomaly literacy, the mitigation instinct, the observability habit, the clean handover: none of these are glamorous. All of them compound.
If you lead engineering teams, run distributed systems at scale, or think seriously about engineering culture and maintenance practices — I’d love to connect.
linkedin.com/in/pradeep · Pune, India