AMA: why flaky tests for GTAC?

dmcnulla asks…

I enjoyed your GTAC presentation on “your tests are not flaky“. How did you pick your topic?

My response…

Firstly thanks for the compliment, you’re very kind.

When I came up for the topic I was working on a system where we were practicing continuous delivery by frequently doing production releases. As we began releasing more frequently the business expected this and so the reliability of our automated tests became more important. We wouldn’t release on a failed build since we were working on a high volume eCommerce site where a small bug could cause an outage costing a very large amount of revenue. We didn’t have a team of testers to fall back onto for any manual regression testing, so we were 100% dependent on our automated tests.

Even though we were clever about building testability into our system, we still had too many full-stack automated tests which would create non-deterministic results.

I believe everyone looks at the same thing slightly differently as we each have a unique lens that we see our world through and everyone’s lens has varying degrees of difference:

“Each of us tends to think we see things as they are, that we are objective. But this is not the case. We see the world, not as it is, but as we are—or, as we are conditioned to see it. When we open our mouths to describe what we see, we in effect describe ourselves, our perceptions, our paradigms.”

~ Stephen R. Covey

As people who were developing and maintaining tests, we were looking at our non-deterministic tests as the test’s fault. What we didn’t do was look through another lens to see that it could actually be the fault of our system as we had built it instead.

This aha! moment struck me when we released a bad build to production that had passed all automated QA by someone re-running our automated tests a number of times (to get them to pass).

We were blinded by perceived ‘test flakiness’: we refused to believe our problems were something else, so I thought it would be a good topic to present. From the feedback I received both at and after the event, it seems I am very much not alone.

 

Running Automated Tests with A/B Testing

Like a lot of modern, data driven sites, WordPress.com uses A/B testing extensively to introduce new features. These tests may be as simple as a label change or as complex as changing the entire sign up flow, for example by offering a free trial.

Since I have been working on a set of automated end-to-end tests for WordPress.com, I have found A/B testing to be problematic for automated testing on this very fast moving codebase, namely:

  1. Automated tests need to be deterministic: having a randomised experiment as an A/B test means the first test run may get an entirely different sign up flow than a second test run which is very hard to automate; and
  2. Automated tests need to know which experiments are running otherwise they may encounter unexpected behaviour randomly.

What we need is two methods to deal with A/B tests when running automated tests:

  1. We need to be able to see which A/B tests are active and compare this to a known list of expected A/B tests – so that we don’t suddenly encounter some unexpected/random behaviour for some of our test runs
  2. We need to be able to set the desired behaviour to the control group so that are our tests are deterministic.

Different sites conduct A/B testing using different tools and approaches, WordPress.com uses HTML5 local storage to set which A/B tests are active and which group the user belongs to.

Luckily it’s easy to read and update local storage using WebDriver and JavaScript. This means our approach is to:

  1. Each time a page object is initialised, there is a call on the base page model that checks the A/B tests that are active using something like return window.localStorage.ABTests; and then compares this to the known list of A/B tests which are checked in as a config item. This fails the test if there’s a new A/B test introduced that isn’t in the list of known tests. This is better than not knowing about the A/B test and failing based upon some non-deterministic behaviour.
  2. When a new A/B test is introduced and we wish to ensure our automated tests always use the control group, we can set this using a similar method window.localStorage.setItem('ABTests','{"flow":"default"}'); and refresh the page.

Ideally it would be good to know and plan every A/B test for our automated e2e tests, but since this isn’t possible, checking against known A/B tests and ensuring control groups are set means our automated tests are at least more consistent and deterministic, and fail a lot faster and more consistently when a new A/B test has been introduced.

How do you deal with non-determinism with A/B tests?