When I joined as the founding engineer of the card payments team, one number haunted every metrics review: debit card success rates were stuck below 50%. Credit cards sailed above 70%. The gap was real. The explanation was confident across the industry:
"Debit card users have lower intent. They're less tech-savvy. It's a cohort problem."
It was a tidy story. Everyone believed it. And I thought it deserved a hard look.
The data that didn't fit
We had recently integrated a payment processor that gave us something rare: granular, step-level visibility into every card transaction — not just aggregate failure rates, but exactly where in the 3DS flow each transaction died. Not one pipeline metric. Individual bank, individual network, individual step.
One bank kept appearing. A major Indian bank — well known for their lunch breaks — contributed over 50% of our debit card volume and showed catastrophically poor success rates across every issuer network they operated on. The cohort theory said: their users aren't completing the flow. The data said: something else is going on.
I looked at app drop-offs at the Pay button click stage. Users were abandoning right after hitting "Pay" — before the OTP screen even appeared. That's not intent failure. That's the screen not loading fast enough to hold their attention. The cohort story was partial truth being used as complete truth.
The culprit: a silent, unpredictable delay
The 3DS authentication flow under a Pay button click involves two sequential steps before the OTP page is shown to the customer. These run synchronously under the same user-facing CTA interaction — and for every other bank and network, they were fast and deterministic.
The bank's ACS was slow. Not broken — slow. And unpredictably so. The architecture wasn't built to absorb that. I confirmed this by introducing two separate monitoring pools with config-driven bank-level filters and watching failure rates during peak transaction windows. The pattern was unmistakable once I looked at the right grain.
This issue had been invisible because existing systemic metrics were focused at the payment processor level — not at the individual bank level. Aggregate monitoring is a lagging indicator. When failure concentrates in a subset of your traffic, it disappears into the noise of your averages.
Three fixes. One right answer.
Once the root cause was clear, the design space was narrow. Three approaches, each with a fundamentally different tradeoff between engineering effort and customer experience:
The third option required the most engineering work. It was also the only one that treated the customer's experience as a non-negotiable design constraint rather than a casualty of infrastructure limitations.
What this really taught me
The cohort theory wasn't wrong — it was incomplete. Intent does vary by user segment. But intent cannot explain a systemic infrastructure failure wearing a user-behaviour mask. The danger isn't the wrong answer; it's the confident wrong answer that shuts down the inquiry.
The architectural principle: Aggregate monitoring is a lagging indicator. When failure is concentrated in a subset of your traffic — one bank, one network, one time window — average metrics will smooth it into background noise. You need the ability to slice at the right grain, instrument at the right layer, and build the observability before you need the insight.
I went on to build a Success Rate routing engine using a Decision Tree with fixed entropy and hierarchy — dynamically selecting the optimal payment gateway per transaction based on live SR signals. That system consistently outperformed internal ML models on success rate.
But none of it would have been possible without first asking the right question: is this actually a user problem, or are we looking at the wrong data?
In payments engineering, that question is the job.
If you work on payment software, fintech infrastructure, or distributed systems at scale — I'd love to connect. The conversation is always more interesting than the metrics.
linkedin.com/in/pradeep · Pune, India