When Test Automation Creates False Confidence

← Back to Insights

High automation coverage does not mean high release confidence. In some BFSI organisations it means the opposite — a team that trusts its green dashboards so completely that it stops asking the questions that would reveal the real risk.

The counterintuitive problem with good automation numbers

Test automation is one of the most significant investments a BFSI engineering team makes. Done well, it accelerates release cycles, reduces regression burden, and builds genuine confidence that the system behaves as expected. Done poorly — or done well against the wrong things — it creates something more dangerous than no automation at all: a false sense of security.

I have reviewed automation suites at banks, insurers, and financial technology firms that had coverage percentages in the 80s and 90s. In several cases, those teams had experienced more production incidents after building the automation than before it. Not because the automation was badly written. Because nobody had asked whether they were automating the right things.

Coverage is a metric of volume. It tells you how much of your codebase, or how many of your test cases, have automated coverage. It tells you nothing about whether that coverage corresponds to the journeys that actually cause incidents when they fail in production.

“85% automation coverage of the wrong journeys delivers less release confidence than 40% coverage of the right ones. The number is real. What it measures is not.”

The five signals that your automation is creating false confidence

These are not theoretical failure modes. They are patterns I have observed repeatedly across BFSI automation programmes, often identified only after a production incident has already occurred.

Your flaky test rate is above 5% and people have stopped caring. A flaky test — one that passes and fails intermittently without a code change — is not a minor inconvenience. It is a signal that your automation has become unreliable as evidence. When engineers start re-running failing tests without investigating why they failed, the automation has lost its function. It is producing noise, not signal. A 10% flaky test rate means that roughly one in ten failures is being dismissed as probably-flaky before anyone has confirmed it is not a real defect. This is how production incidents get missed.
Your critical user journeys are not explicitly identified and prioritised in your coverage strategy. Most automation coverage grows organically — developers write unit tests for the code they build, QA engineers automate the test cases that are easiest to automate. The result is dense coverage of low-risk areas and thin or absent coverage of high-risk journeys. In a retail bank, the payment confirmation flow, the account balance calculation, and the overdraft limit check are high-risk journeys. If you cannot point to explicit automation coverage of these specific journeys by name, your coverage percentage is not telling you what you think it is.
Your automation suite has not been fully executed in the past five business days. An automation suite that exists but is not regularly executed is not a test suite — it is a historical record of tests that once ran. Codebases change. Environments change. A suite that was reliable six months ago and has not been run since may have accumulated silent failures that nobody has seen. If the last full execution was more than a week ago, the green dashboard showing the results of that run is not current evidence. It is a photograph of a past state.
Your automation validates what the system does rather than what it should do. This is the subtlest and most common form of false confidence. When automation is written against existing system behaviour rather than against specified requirements, it tends to pass whenever the system behaves consistently — even if the consistent behaviour is wrong. A payment rounding error that has existed since the system was built will pass every automated test that was written to validate the current behaviour. The test is not detecting defects. It is detecting stasis.
Nobody can explain why any specific test is in the regression suite. In a healthy automation programme, every test exists for a reason — a risk it mitigates, a journey it validates, a defect class it prevents recurring. In a suite that has grown organically over years, tests accumulate without owners, without clear purpose, and without retirement criteria. A regression suite of 3,000 tests where nobody can explain the rationale for a third of them is not a quality asset. It is technical debt wearing the costume of quality assurance.

What good looks like

An automation suite that builds genuine release confidence has three characteristics: it is built against explicitly prioritised risk — the journeys that matter most to the business are covered first and most deeply; it is maintained as a living asset — flaky tests are investigated and resolved, not bypassed; and it is interpreted by humans, not trusted blindly — a green result is the start of the release confidence conversation, not the end of it.

The coverage metric worth tracking instead

Replace overall automation coverage percentage with a single, more revealing metric: critical journey automation coverage.

Start by listing the ten to twenty user journeys that, if they failed in production, would cause the most significant business impact. In a retail bank, this might be: payment initiation, payment confirmation, account balance display, overdraft calculation, direct debit processing, standing order execution, and account opening. In an insurer: policy issuance, claims submission, premium calculation, and renewal processing.

For each journey, ask: do we have automated regression coverage that would detect a meaningful failure in this journey before it reaches production? Not coverage of the code that implements the journey — coverage of the journey itself, end to end, at the level of granularity that would catch the failures that actually matter.

Track this number separately from overall coverage. Report it to leadership separately. A team with 40% overall coverage but 95% critical journey coverage is in a better position than a team with 85% overall coverage and 60% critical journey coverage. The number that matters is the one that correlates with production incidents, not the one that looks most impressive on a dashboard.

AI-assisted coverage analysis changes what is possible

Identifying coverage gaps in a large automation suite used to require manual review — hours of examining test inventories against risk registers. AI-assisted coverage analysis can accelerate this significantly, mapping existing test coverage against identified risk areas and surfacing gaps that would take days to find manually.

The output still requires senior engineering interpretation. AI can identify that the payment confirmation journey has only three automated tests with no negative path coverage. It cannot tell you whether three tests are adequate, or which negative paths are most likely to cause a production incident in your specific environment. That judgement remains the work of an experienced engineer who understands the system, the risk profile, and the regulatory context.

AI accelerates the evidence gathering. It does not replace the assessment.

What to do this week

List your ten highest-risk user journeys. If you cannot list them in fifteen minutes, that is itself a finding. The inability to name the journeys that matter most is a governance gap.
Check your flaky test rate over the last 30 days. If it is above 5%, investigate the top five flakiest tests this week. Do not re-run them. Understand them.
Find out when your full regression suite last ran. If it was more than five business days ago, run it now and treat the results as a fresh baseline, not a confirmation of prior results.
Ask someone who did not build the suite to explain why the last ten tests in the inventory exist. If they cannot, you have found your starting point for a coverage review.

Free Download

The BFSI Release Risk Checklist

Includes automation coverage questions to ask before every release.

Download Free →

Anthony Adeloye

Founder & Principal Consultant, CalyTeQ

Anthony has 28 years of quality engineering experience across Banking, Financial Services, Insurance, Government, and Enterprise technology. He has reviewed and rebuilt automation suites at Deutsche Bank, Commerzbank, Fujitsu, EY, Sky UK, and IBM, including programmes where high coverage numbers were masking significant gaps in release readiness.

When Test Automation Creates False Confidence — And How To Spot It Before Your Next Release

The counterintuitive problem with good automation numbers

The five signals that your automation is creating false confidence

The coverage metric worth tracking instead

AI-assisted coverage analysis changes what is possible

What to do this week

Automation should build confidence.
Not disguise risk.

When Test Automation Creates False Confidence — And How To Spot It Before Your Next Release

The counterintuitive problem with good automation numbers

The five signals that your automation is creating false confidence

The coverage metric worth tracking instead

AI-assisted coverage analysis changes what is possible

What to do this week

Automation should build confidence.Not disguise risk.

Automation should build confidence.
Not disguise risk.