Payments Engineering Systems Architecture · May 2026

"Debit card users just don't complete transactions."
I didn't buy it.

How a single observation — a silent delay on one bank's OTP server — upended an industry-wide assumption and took card payment success rates from below 50% to above 70%.

PD
Pradeep A. Dalvi
Computer Engineer · Distributed Systems & Payments
8 min read
<50%
Debit card success rate before the fix
70%+
Success rate after the redesign
30%
Transactions silently delayed by one bank's ACS

When I joined as the founding engineer of the card payments team, one number haunted every metrics review: debit card success rates were stuck below 50%. Credit cards sailed above 70%. The gap was real. The explanation was confident across the industry:

"Debit card users have lower intent. They're less tech-savvy. It's a cohort problem."

It was a tidy story. Everyone believed it. And I thought it deserved a hard look.

The data that didn't fit

We had recently integrated a payment processor that gave us something rare: granular, step-level visibility into every card transaction — not just aggregate failure rates, but exactly where in the 3DS flow each transaction died. Not one pipeline metric. Individual bank, individual network, individual step.

One bank kept appearing. A major Indian bank — well known for their lunch breaks — contributed over 50% of our debit card volume and showed catastrophically poor success rates across every issuer network they operated on. The cohort theory said: their users aren't completing the flow. The data said: something else is going on.

"Every metric I trusted pointed to a systemic failure being buried under a user-behaviour excuse. In payments, that's expensive silence."

I looked at app drop-offs at the Pay button click stage. Users were abandoning right after hitting "Pay" — before the OTP screen even appeared. That's not intent failure. That's the screen not loading fast enough to hold their attention. The cohort story was partial truth being used as complete truth.

The culprit: a silent, unpredictable delay

The 3DS authentication flow under a Pay button click involves two sequential steps before the OTP page is shown to the customer. These run synchronously under the same user-facing CTA interaction — and for every other bank and network, they were fast and deterministic.

3DS authentication flow — under Pay button click
1
PayerAuthEnrolmentCheck
Verify the card is enrolled for 3DS. The payment gateway initiates this check; the issuer bank returns enrollment status. Fast and deterministic for all banks.
2
Submit PayerAuthReq → trigger OTP culprit
The bank's Access Control Server (ACS) fires the OTP to the cardholder's mobile and generates the authentication page. For this one bank, this step took dramatically longer than expected — unpredictably — over 30% of the time. The delay was invisible in aggregate monitoring.
3
PayerAuthRes
Validate the OTP entered by the user and return the authentication result. Users who reach this step complete it normally.

The bank's ACS was slow. Not broken — slow. And unpredictably so. The architecture wasn't built to absorb that. I confirmed this by introducing two separate monitoring pools with config-driven bank-level filters and watching failure rates during peak transaction windows. The pattern was unmistakable once I looked at the right grain.

This issue had been invisible because existing systemic metrics were focused at the payment processor level — not at the individual bank level. Aggregate monitoring is a lagging indicator. When failure concentrates in a subset of your traffic, it disappears into the noise of your averages.

Three fixes. One right answer.

Once the root cause was clear, the design space was narrow. Three approaches, each with a fundamentally different tradeoff between engineering effort and customer experience:

skipped Increase timeouts globally
pros
Quick config change. No code changes required. Ships in hours.
cons
Inflates thread-pool consumption across all banks. Every merchant absorbs the worst offender's latency. Customer stares at a spinner — CTA conversion drops.
skipped Offload steps 1 & 2 after Pay click
pros
Better thread efficiency. Steps process asynchronously without blocking the primary transaction path.
cons
Customer still stares at a blank or static screen. Blank screen anxiety still triggers drop-offs. Doesn't solve the perceived experience.
chosen Branded loader page with self-submitting form
pros
Customer sees branded progress animation while steps 1 & 2 process in the background. Zero changes to intermediate systems or thread pools. Maximum perceived responsiveness.
cons
Significant design changes required. Requires hosting a static loader page per transaction context. Implemented in two days.

The third option required the most engineering work. It was also the only one that treated the customer's experience as a non-negotiable design constraint rather than a casualty of infrastructure limitations.

Result

Debit card success rates for this bank went from below 50% to above 70% — pulling the entire gateway's aggregate success rate up significantly. The payment processor noticed. They came to our business team repeatedly, asking what our "secret ingredient" was. Other large merchants on the same processor, with the same bank, were still stuck with the same poor rates.

What this really taught me

The cohort theory wasn't wrong — it was incomplete. Intent does vary by user segment. But intent cannot explain a systemic infrastructure failure wearing a user-behaviour mask. The danger isn't the wrong answer; it's the confident wrong answer that shuts down the inquiry.

The architectural principle: Aggregate monitoring is a lagging indicator. When failure is concentrated in a subset of your traffic — one bank, one network, one time window — average metrics will smooth it into background noise. You need the ability to slice at the right grain, instrument at the right layer, and build the observability before you need the insight.

I went on to build a Success Rate routing engine using a Decision Tree with fixed entropy and hierarchy — dynamically selecting the optimal payment gateway per transaction based on live SR signals. That system consistently outperformed internal ML models on success rate.

But none of it would have been possible without first asking the right question: is this actually a user problem, or are we looking at the wrong data?

In payments engineering, that question is the job.

If you work on payment software, fintech infrastructure, or distributed systems at scale — I'd love to connect. The conversation is always more interesting than the metrics.

linkedin.com/in/pradeep  ·  Pune, India