Fixing bugs in production: is it that expensive any more?

You’ve most likely seen a variant of this chart before:

bug fix costs

I hadn’t seen it for a while, until yesterday, but it’s an old favourite of test managers/test consultants to justify a lot of testing before releasing to production.

But I question whether it’s that accurate anymore.

Sure, in the good old days of having a production release once or twice a year it cost a large order of magnitude more to fix a bug in production, but does it really cost that much more in the present age of continuous delivery/continuous deployment where we release into production every fortnight/week/day?

If the timeline on the chart above is a year then of course bugs will cost more to fix, because presumably, if the project took a year to start with, you don’t have a very rapid software development process. And there’s more likely to be requirements ‘bugs’ in production because an awful lot happened in the year that the requirement was being developed. Hence along came agile with its smaller iterations and frequent releases.

Mission critical systems aside, most web or other software applications we build today can be easily repaired.

Big waterfall projects, like building a plane, are bound to fail. The Boeing 787 Dreamliner was an epic fail. Not only was it five delays and many years late, it had two major lithium ion battery faults in its first 52,000 hours of flying which caused months of grounding and has no doubt affected future sales, causing millions of dollars in damages. But it seems to have been well tested:

“To evaluate the effect of cell venting resulting from an internal short circuit, Boeing performed testing that involved puncturing a cell with a nail to induce an internal short circuit. This test resulted in cell venting with smoke but no fire. In addition, to assess the likelihood of occurrence of cell venting, Boeing acquired information from other companies about their experience using similar lithium-ion battery cells. On the basis of this information, Boeing assessed that the likelihood of occurrence of cell venting would be about one in 10 million flight hours.”

NTSB Interim Report DCA13IA037 pp.32-33

After months of grounding, retesting, and completely redesigning the battery system, the cause of the original battery failures are still unknown. If they can’t work out what the problem is after it has occured twice in production, it’s not likely it could have been found or resolved in initial testing.

But most of us don’t work on such mission critical systems anyway.

And production fixes can be very easy.

Take this very different example; I provide support for a production script that uploads a bunch of files to a server. There was a recent issue where a file-name had an apostrophe in it which meant this file was skipped when it should have been uploaded.

Upon finding out about the problem I immediately looked at my unit tests. Did I have a unit test with a file name with an apostrophe? No I didn’t. I wrote a quick unit test – it failed: as expected. I made a quick change to the regular expression constant that matches file names to include an apostrophe, I reran the unit test which passed. Yippee. I quickly reran all the other unit and integration tests and all passed, meaning I could confidently package and release the script. All of this was done in a few minutes.

I could have possibly prevented this happening by doing more thorough testing to begin with, but I am pretty sure that would have taken more effort than it did for me to fix the production bug, by writing a test for it and repackaging it. So for me it wasn’t an increase in cost whatsoever to find that bug ‘late’.

Unless you’re working on mission critical software, shipping some bugs into production is almost always better than shipping no software at all. If you work on very small, frequent deployments into production, the cost of fixing bugs once they have gone live will only be marginally greater than trying to find every bug before you ship. The longer your spend making sure your requirements are 100% correct and everything is 100% tested, ironically, your software is more likely to be out of date, and hence incorrect, once you finally go live.