Fixing bugs in production: is it that expensive any more?

You’ve most likely seen a variant of this chart before:

bug fix costs

I hadn’t seen it for a while, until yesterday, but it’s an old favourite of test managers/test consultants to justify a lot of testing before releasing to production.

But I question whether it’s that accurate anymore.

Sure, in the good old days of having a production release once or twice a year it cost a large order of magnitude more to fix a bug in production, but does it really cost that much more in the present age of continuous delivery/continuous deployment where we release into production every fortnight/week/day?

If the timeline on the chart above is a year then of course bugs will cost more to fix, because presumably, if the project took a year to start with, you don’t have a very rapid software development process. And there’s more likely to be requirements ‘bugs’ in production because an awful lot happened in the year that the requirement was being developed. Hence along came agile with its smaller iterations and frequent releases.

Mission critical systems aside, most web or other software applications we build today can be easily repaired.

Big waterfall projects, like building a plane, are bound to fail. The Boeing 787 Dreamliner was an epic fail. Not only was it five delays and many years late, it had two major lithium ion battery faults in its first 52,000 hours of flying which caused months of grounding and has no doubt affected future sales, causing millions of dollars in damages. But it seems to have been well tested:

“To evaluate the effect of cell venting resulting from an internal short circuit, Boeing performed testing that involved puncturing a cell with a nail to induce an internal short circuit. This test resulted in cell venting with smoke but no fire. In addition, to assess the likelihood of occurrence of cell venting, Boeing acquired information from other companies about their experience using similar lithium-ion battery cells. On the basis of this information, Boeing assessed that the likelihood of occurrence of cell venting would be about one in 10 million flight hours.”

NTSB Interim Report DCA13IA037 pp.32-33

After months of grounding, retesting, and completely redesigning the battery system, the cause of the original battery failures are still unknown. If they can’t work out what the problem is after it has occured twice in production, it’s not likely it could have been found or resolved in initial testing.

But most of us don’t work on such mission critical systems anyway.

And production fixes can be very easy.

Take this very different example; I provide support for a production script that uploads a bunch of files to a server. There was a recent issue where a file-name had an apostrophe in it which meant this file was skipped when it should have been uploaded.

Upon finding out about the problem I immediately looked at my unit tests. Did I have a unit test with a file name with an apostrophe? No I didn’t. I wrote a quick unit test – it failed: as expected. I made a quick change to the regular expression constant that matches file names to include an apostrophe, I reran the unit test which passed. Yippee. I quickly reran all the other unit and integration tests and all passed, meaning I could confidently package and release the script. All of this was done in a few minutes.

I could have possibly prevented this happening by doing more thorough testing to begin with, but I am pretty sure that would have taken more effort than it did for me to fix the production bug, by writing a test for it and repackaging it. So for me it wasn’t an increase in cost whatsoever to find that bug ‘late’.

Unless you’re working on mission critical software, shipping some bugs into production is almost always better than shipping no software at all. If you work on very small, frequent deployments into production, the cost of fixing bugs once they have gone live will only be marginally greater than trying to find every bug before you ship. The longer your spend making sure your requirements are 100% correct and everything is 100% tested, ironically, your software is more likely to be out of date, and hence incorrect, once you finally go live.

Author: Alister Scott

Alister is an Excellence Wrangler for Automattic.

6 thoughts on “Fixing bugs in production: is it that expensive any more?”

  1. Was it really that expensive before? The Leprechauns of S/W engineering casts some doubts on where the data for that curve came from.
    Agree with the main point of your post though :)


  2. I always interpreted the “cost of bug” graph as a mixture of cost to fix and loss because of the bug (revenue, reputation, etc…), which is only relevant for PROD of course.
    Also, in the context of Continuous Delivery and one trunk development, a severe PROD bug may interrupt normal development activity the time the fix is released to PROD, hence incurring another cost hit.
    But in overall I agree that nowadays production bugs are less costly and scary because if you have done a good job upfront they are not severe, and good testing and deployment practices enable a safe fix to be pushed to PROD quickly.


    1. I originally thought that about the cost too but reading more into the studies found it was purely cost of repair that was measured. That’s why I don’t agree with it.


  3. I agree to the thought of the cost of fixing bugs in production, specially in web, is not an issue any more. Though the risk of fixing the bug still remains high as we are playing with a live production system compared to if the same bug could have been found in earlier rounds of testing.


  4. When I use this graph I do mention all related costs: investigation, re-assessing requirements, redesign, redevelopment, retesting, acceptance and implementation.
    Although continuous delivery makes it easier and faster to fix problems, it brings the risk that people don’t “think first” when designing and building an information system. And in my opinion it still is much more expensive to fix defects than to prevent them.
    So although continuous delivery does make fixing faster and thus a little cheaper real improvement can only be achieved by making sure the quality of the application is sufficient, that is “fit for purpose” (not worse and not better).
    The quest for this optimum will probably keep us busy for the next couple of decades ;-) Continuous delivery is one of the achievements during this quest. And DevOps is a good help since it stimulates collaboration which is essential in reaching the quality level that is needed.


Comments are closed.