Why Most Software Releases Fail

← Back to Insights

The organisations experiencing the most production incidents are not the ones doing the least testing. They are often the ones doing the most, but measuring the wrong things. The gap between QA activity and release confidence is where incidents are born.

The testing paradox

I have seen testing volumes increase dramatically over my career: more automation, more tooling, more dashboards, while the frequency of significant production incidents has not decreased at the same rate. In some BFSI organisations, it has increased.

This is not a coincidence. It is a structural problem. And it is one that senior technology leaders are increasingly aware of, even if they cannot always name it precisely. The symptom is a familiar one: your QA team delivers a green dashboard before every release, and yet your on-call engineers are still firefighting at 2am.

The cause is almost never a lack of testing. It is a lack of genuine confidence in what the testing means.

“More testing does not equal more confidence. It can equal more data, which is a very different thing.”

What “release confidence” actually means

When a BFSI board or a CTO asks “are we ready to release?”, they are not asking for a test results summary. They are asking a specific risk question: what is the probability that this release causes a significant incident, and what is the business impact if it does?

Most QA processes are not designed to answer that question. They are designed to answer a different one: have we completed the agreed testing activities? That is a compliance question, not a risk question. Answering it correctly tells you whether the team did what it said it would do. It tells you almost nothing about whether the system is safe to release.

The distinction matters enormously in regulated financial services environments, where the cost of a Sev1 production incident is not just technical. It is operational, reputational, regulatory, and in some cases, existential.

The five failure modes I see repeatedly

Across decades of quality engineering work at Deutsche Bank, Commerzbank, Fujitsu, EY, Sky UK, and the Scottish Government, the same patterns recur. BFSI production incidents are almost always traceable to one or more of these five structural failures:

Automation coverage that misses the critical path. High automation coverage percentages that do not correspond to the highest-risk user journeys. A system with 85% automation coverage might have zero automated coverage of the payment confirmation flow that processes 60% of transaction value.
Performance assumptions that have never been validated. NFRs defined at the start of a programme and never formally tested against the production configuration. The test environment behaved correctly. Production, with real data volumes and real network conditions, did not.
Governance that produces evidence but not assurance. Sign-off processes that confirm checklists were completed, not that the system is ready. The distinction is subtle but the consequences are not.
Defect triage that accepts too much risk. Severity-3 defects accepted for release because “we’ll fix them in the next sprint.” Three of those accepted Sev-3s interact in production in a way nobody anticipated, and the result is a Sev-1.
The trust gap between delivery and leadership. Delivery teams who believe the system is not ready but lack the evidence framework to make that case to a programme director under deadline pressure. The release proceeds. The incident follows.

The pattern in numbers

DORA research consistently shows that organisations in the bottom quartile of software delivery performance have a change failure rate of 46%: nearly half of all production changes cause incidents requiring hotfixes or rollbacks. Elite performers sit at below 5%. The gap is not testing volume. It is the quality of the risk assessment that happens before each release.

Why AI makes this problem harder before it makes it easier

The emergence of AI-assisted testing tools has added a new dimension to this challenge. AI can generate test cases at a speed and volume that human testers cannot match. It can identify coverage gaps, suggest edge cases, and analyse test results with unprecedented speed.

All of this is genuinely useful. But it compounds the existing problem if the underlying framework is wrong. More test cases generated by AI, applied against a system without a robust risk-prioritised coverage strategy, produces more data that still does not answer the release confidence question.

The value of AI in quality engineering is not in generating volume. It is in accelerating the analysis that a senior engineer then interprets. The interpretation remains a human responsibility: the judgement about what the evidence means, and what risk it represents to a specific organisation releasing a specific system into a specific regulatory environment.

This is not a popular thing to say in an industry that is deeply excited about AI. It is, however, accurate.

“AI in quality engineering accelerates the collection of evidence. It does not replace the judgement required to interpret it at board level.”

What needs to change

The shift required is not primarily a technical one. It is a framing shift: from quality as a delivery activity to quality as a risk management function. Three changes make the most practical difference:

Replace coverage metrics with confidence metrics. Stop reporting automation coverage as a percentage of test cases. Start reporting it as a percentage of the highest-risk user journeys that have validated, reliable automated coverage. One number is a compliance metric. The other is a release risk indicator.
Formalise the Go / Go with Conditions / No-Go decision. Every production release should have a documented risk position, not just a sign-off checklist. The Go with Conditions model is particularly valuable: it allows releases to proceed with explicit risk acceptance and defined compensating controls, rather than the binary pressure of go-or-delay.
Introduce independent assurance at the release gate. The delivery team that built the system cannot provide independent assurance that the system is ready. This is not a criticism of delivery teams: it is a structural reality. The same independence principle that governs financial audits applies to high-stakes software releases in regulated environments.

The commercial case is straightforward

A Sev1 production incident in a retail banking environment costs, conservatively, between £15,000 and £150,000 in direct costs: engineering time, incident management, regulatory reporting, customer remediation. The reputational cost is harder to quantify and typically larger.

A structured release risk baseline engagement costs a fraction of a single incident. An Executive Release Assurance assessment for a major release costs less than the overtime bill for the incident response team if that release fails.

The question is not whether senior quality engineering advisory is affordable. It is whether the cost of not having it is acceptable. In most cases, for most BFSI organisations, it is not.

Anthony Adeloye

Founder & Principal Consultant, CalyTeQ

Anthony Adeloye is the Founder and Principal Consultant of CalyTeQ, a specialist quality engineering consultancy serving BFSI organisations. He has held senior roles at Deutsche Bank, Commerzbank, Fujitsu, EY, Sky UK, and IBM, leading performance engineering, test automation, and quality strategy programmes across the UK, Germany, and Luxembourg. CalyTeQ is his advisory practice, built to bring senior-level quality engineering expertise to BFSI organisations without the overhead of a large firm.

Why Most Software Releases Fail — And Why Testing Isn’t the Real Problem