Prioritising Test Reliability over Perfection

If you saw my talk at GTAC last year, ‘your tests aren’t flaky‘, then you’re probably aware of my view on flaky tests actually being indicative of broader application/systems problems that we should address over making our tests less flaky.

But what if you’re in a situation where you work with a system where you can’t feasibly improve the reliability? Say you’ve got a domains page that should show you a list of available domains but since it’s using an external third-party service it sometimes just shows nothing?


My original response is you do all you can to make that page reliably show the expected result, but what it you can’t ensure that? Do you fail the running test?

Originally I thought it was best to fail the test and that way we could see when we had a problem with domains, but the problems are so intermittent that it wouldn’t be an accurate reflection of the domains page working.

I’m not a huge believer in the ‘retry’ approach with automated tests; I was bitten by a rather nasty bug previously which we had masked by automatically re-running our automated tests.

But what if we did an intelligent retry? Assume you get to this page and search for a domain name, as a user (or manual tester) what would you do if you didn’t see any results? Well I’d note the issue if I was testing it, and I would then either refresh the page or try searching for something else. Why can’t our automated tests do just that?

What we’ve implemented now for known ‘trouble spots’ of our application:

  1. Document the issue (in Github for example) so that it’s known and you have a reference # to it
  2. In the automated test, when testing the functionality of a known trouble spot, write code that automatically retries a certain number of times (typically once or twice) to get the desired outcome – in the case above a domains result listing – referencing the issue you’ve raised
  3. Whenever the code has to retry – using an API automatically log this to our real-time chat application (Slack) where we have a dedicated channel for real time notifications from running automated tests referencing the issue raised for it
  4. Every now and then review Slack to see how many times the problem has occurred, if it hasn’t been happening any more you can close the issue (and remove the retry code), if it’s increased then you can leave a message on the issue with some information about how often it is occurring (to hopefully increase the importance!)

Our Outcome is Greener Builds

By implementing this in about 11 places in our automated end-to-end tests mean we have much, much more reliable test results so we know when we have a true issue, not another recurrence of a known issue. We can track the progress of application improvements to flakiness without this impacting the overall view of our application’s health.

What are your thoughts? Have you tried something similar?

Author: Alister Scott

Alister is an Excellence Wrangler for Automattic.

3 thoughts on “Prioritising Test Reliability over Perfection”

  1. Makes sense. Done that in some cases, especially when outcome is time dependent, which in turn depends on load on the system at the time leading to test failures.

    Trick for me was to figure out how many times to retry which is acceptable before failing the test.

    Liked by 1 person

    1. Thanks for your comment Anand.

      Trick for me was to figure out how many times to retry which is acceptable before failing the test.

      We typically just retry once in my trouble spots, but in one particular area we have a longer timeout also.


Comments are closed.