This is a talk I delivered at the Google Test Automation Conference (GTAC) on Tuesday 10th November at Google in Cambridge, Massachusetts.
I am going to be using the F-word a lot in this talk. Like a lot. I apologize in advance if I offend. You know the F-Word don’t you? Also known as the F-Bomb? We test engineers use it all the time. Yes, that one: ‘flaky’.
So last year I went to GTAC in Kirkland and like a lot of other people there I was shocked by the language. Every second talk was dropping the highly offensive F word. Over lunch we were jokingly calling it ‘FlakyTestCon 14’.
There were talks about how to reduce it, ways to minimize it and what to do if you have a flaky test.
Flaky tests are something that most of us experience every day in our lives. They cause no end of despair. And we should be afraid- very afraid. The concept of flaky tests has the potential to bring our testing profession down.
There’s lot of solutions proposed around test design and test resilience, but we continue to suffer from this problem. The solutions don’t work.
As the f-word gets dropped more and more- people start to doubt whether ANY tests are reliable and if testing is worth it at all. What if testers are flaky like their tests? We need to kill the term flaky.
So, I got back to work after the conference and I was sitting at my desk. I overheard a developer dismiss a failed build as: ‘oh that’s just a flaky tests’.
I was shocked. Shocked and offended. The f word. An excuse. An excuse to ignore testing. People were talking about our work and using the f-word!
I swore to myself from that moment forward I would not dismiss issues as flakiness. I would stop calling our tests flaky. But if our tests aren’t at fault: what is? Our applications.
You see: flakiness implies it’s the tests problem- and no one needs to worry about it from an application development point of view. But what if it’s not the test’s problem, what if it’s your apps problem?
Developers had started using the word flaky to describe ANY failing test – imagine how useful a flaky smoke alarm would be. It would go off quite often and you’d be like, that smoke alarm is flaky so just ignore it. You may as well not have the smoke alarm.
You may as well ignore all testing. Why have testers? Why have test engineers? …and like that our profession became under threat.
So how do we get to the bottom of flaky? What if you had a toaster sitting directly under the “flaky” smoke detector? And it was on a high setting? Is your smoke alarm still flaky? Or is the thing you’re measuring not quite right?
We need to get to the bottom of it- we need to look at the whole story and stop using the f-word. Let’s look at our applications!
So, not too long after this happened, we had another issue with inconsistent (‘flaky’) tests at my work.
We had a bunch of acceptance tests that gave us really good confidence that our app was ready to release to production (which we did very frequently). We wouldn’t release our app unless they all passed.
We had 500 tests and ran them in parallel in about 50 groups of each 10 tests, which meant they could run in as little as 10 minutes for the entire suite.
We wanted to get a build released and about 5 of 50 groups failed on the first run. So we re-ran the failed tests, I think 1 group failed, so we re-ran that group and it passed. Build was green, and we released it to production. We didn’t ask why they’d failed in the first place.
Shortly after releasing to production we started seeing customers losing sessions/crashing. But all our tests passed!
We had a closer look into these and found an obscure caching bug where the display mode (desktop/mobile) was being incorrectly cached and displayed for the wrong session.
This was killing our user’s sessions and affecting our sales! It wasn’t immediately obvious as it wasn’t happening for every customer (only about 10%).
5 groups out of about 50 is also 10%. Our tests found this bug but we ignored it! We ignored it until it didn’t happen anymore (out of luck) but our tests knew there was an issue.
Trust the tests. Don’t re-run blindly (till you get the response you want)- find the issue and fix it. Kill the possibility for flakiness.
In another situation for an app I was working on we were running some parallel tests, because it’s efficient and replicates user reality, when we encountered ‘flakiness’ or inconsistency in our test results.
This app happened to have a ‘feature’ (we didn’t know about) where a subsequent login to the first would destroy the original session in some cases. If two tests happen to run at the same time, the second test would make the first one fail.
Spending time trying to make either test resilient or reliable would be fruitless, it was built into our application design.
The app should have been testable and shouldn’t have killed the subsequent sessions. This wasn’t meant to happen.
Rather than go on a flakiness hunt for the flaky test- we refused to believe in it and we discovered application quirks disguised as flakiness.
Faith in testing was restored- the tests revealed the real story. Thank goodness it wasn’t a flaky monster under our bed- we shone our torch (our trusty tests) on it: it’s was just some dusty bits of our app design all along.
Nothing to be afraid of.
We had a single-page internal web app which used dialogs upon dialogue upon pop ups with a-sync servers calls everywhere. Our test would dismiss a dialogue, it would miraculously reopen it when a callback was later received. This would make our tests very flaky!
We could make our tests resilient as to expect a random pop up at any moment and dismiss it (I’ve seen this done), But why write such complicated tests: what’s the real issue?
Why should we have highly resilient tests for a clunky app? Shouldn’t we look at the reason behind the flake?
We did this, guess what, our tests were suddenly consistently passing.
Flakiness: 0, application test ability: 1.
We worked on an app to order pizza. It was a multi step flow and each step relied on state from a previous step. If we were to write tests that included each step and any combination of steps, there would be lots of possibilities for flakiness and inconsistency.
We chose to write a test specific controller (which was embedded as part of our application) to instantly display any page in the app with any state you require. No navigating pages. No constraints around time etc.
If you can set up a single URL with everything you need to test some functionality, this means your tests can be focused on testing, avoiding flakiness by avoiding navigation & layers of setup and options.
We built testability well and truly into our app. The integration of testing into system development saved us from ‘flakiness’ as well as complicated after the fact test writing.
Don’t be afraid to add testability specific features to your app that don’t serve a functional purpose. I recently had to get new tyres on my car and realized that a lot of tyres have testability features called tread indicators. These don’t serve a functional purpose: the tyre doesn’t need them to operate, but it makes it possible for anyone to quickly and easily test the tread of the tyre without any complicated measurement instruments.
Why don’t you consider investing in similar testability features for your apps?
These are two of the most annoying messages and pop-ups I see on Internet sites I use. These are particular to sites that don’t cater well for browser navigation. Tests for these sites can be flaky as it’s really hard to handle these pop-ups.
They also cause fear in our users- has it taken my money? Have I double booked? How long should I look at this screen before refreshing? Cue user panic.
What if we built our apps so we didn’t have these problems? Our tests wouldn’t need to cater for these and we’d have a better user experience.
How? Design and build your app with testability in mind: an integrated team of developers and testers throughout the whole process is ideal.
A testable app is a usable app. Usable apps are not flaky.
So you might be asking- what can I do to help kill flaky? We all have a role to play in restoring faith in our profession and pushing flaky into extinction.
1. don’t blindly re-run tests
- If you roll a dice enough times it’ll give you the number you want- but is that the same thing that happens if a real user rolled the dice? No, this isn’t realistic.
- We need to look at why it failed the first time: why you didn’t roll a 6 straight away? There’s a reason you didn’t roll a 6, it’s a dice after all, maybe you need to use a different approach altogether.
2. use ‘flaky’ tests as insights into your apps
- Flaky tests are not useless: they’re telling you something, you just need to work out what that something is.
- Be the ‘test whisperer’ and decode it
- You’ll be rewarded with a secret about your app or an area for enhancement
3. build testability into your apps
- Flakiness comes from ‘after the fact testing’: we are not an afterthought, testers and test engineers should be part of application design teams, we need to consider and build testing and testability into our application designs
- Efficiency, effectiveness and testing confidence comes from this strong base
I’ll finish with a final message to flaky tests:
“What I do have are a very particular set of skills, skills I have acquired over a very long testing career. Skills that make me a nightmare for flaky tests like you. I will look for you, I will find you, and I will kill you”
~ Liam Neeson, Test Engineer