The organisations experiencing the most production incidents are not the ones doing the least testing. They are often the ones doing the most — but measuring the wrong things. The gap between QA activity and release confidence is where incidents are born.
I have been working in software quality engineering since 1998. In that time, I have seen testing volumes increase dramatically — more automation, more tooling, more dashboards — while the frequency of significant production incidents has not decreased at the same rate. In some BFSI organisations, it has increased.
This is not a coincidence. It is a structural problem. And it is one that senior technology leaders are increasingly aware of, even if they cannot always name it precisely. The symptom is a familiar one: your QA team delivers a green dashboard before every release, and yet your on-call engineers are still firefighting at 2am.
The cause is almost never a lack of testing. It is a lack of genuine confidence in what the testing means.
“More testing does not equal more confidence. It can equal more data — which is a very different thing.”
When a BFSI board or a CTO asks “are we ready to release?”, they are not asking for a test results summary. They are asking a specific risk question: what is the probability that this release causes a significant incident, and what is the business impact if it does?
Most QA processes are not designed to answer that question. They are designed to answer a different one: have we completed the agreed testing activities? That is a compliance question, not a risk question. Answering it correctly tells you whether the team did what it said it would do. It tells you almost nothing about whether the system is safe to release.
The distinction matters enormously in regulated financial services environments, where the cost of a Sev1 production incident is not just technical. It is operational, reputational, regulatory, and — in extreme cases — existential.
Across 28 years of quality engineering work at Deutsche Bank, Commerzbank, Fujitsu, EY, Sky UK, and the Scottish Government, the same patterns recur. BFSI production incidents are almost always traceable to one or more of these five structural failures:
DORA research consistently shows that organisations in the bottom quartile of software delivery performance have a change failure rate of 46% — nearly half of all production changes cause incidents requiring hotfixes or rollbacks. Elite performers sit at below 5%. The gap is not testing volume. It is the quality of the risk assessment that happens before each release.
The emergence of AI-assisted testing tools has added a new dimension to this challenge. AI can generate test cases at a speed and volume that human testers cannot match. It can identify coverage gaps, suggest edge cases, and analyse test results with unprecedented speed.
All of this is genuinely useful. But it compounds the existing problem if the underlying framework is wrong. More test cases generated by AI, applied against a system without a robust risk-prioritised coverage strategy, produces more data that still does not answer the release confidence question.
The value of AI in quality engineering is not in generating volume. It is in accelerating the analysis that a senior engineer then interprets. The interpretation — the judgement about what the evidence means, and what risk it represents to a specific organisation releasing a specific system into a specific regulatory environment — remains a human responsibility.
This is not a popular thing to say in an industry that is deeply excited about AI. It is, however, accurate.
“AI in quality engineering accelerates the collection of evidence. It does not replace the judgement required to interpret it at board level.”
The shift required is not primarily a technical one. It is a framing shift — from quality as a delivery activity to quality as a risk management function. Three changes make the most practical difference:
A Sev1 production incident in a retail banking environment costs, conservatively, between £15,000 and £150,000 in direct costs — engineering time, incident management, regulatory reporting, customer remediation. The reputational cost is harder to quantify and typically larger.
A structured release risk baseline engagement costs a fraction of a single incident. An Executive Release Assurance assessment for a major release costs less than the overtime bill for the incident response team if that release fails.
The question is not whether senior quality engineering advisory is affordable. It is whether the cost of not having it is acceptable. In most cases, for most BFSI organisations, it is not.
Anthony has 28 years of quality engineering experience across Banking, Financial Services, Insurance, Government, and Enterprise technology. He has held senior roles at Deutsche Bank, Commerzbank, Fujitsu, EY, Sky UK, and IBM, leading performance engineering, test automation, and quality strategy programmes across the UK, Germany, and Luxembourg. CalyTeQ is his advisory practice — built to bring senior-level quality engineering expertise to BFSI organisations without the overhead of a large firm.
Read more about Anthony →A structured quality baseline or Executive Release Assurance engagement gives you a clear, evidence-based picture of where you stand — and what to do about it.