This is an approximate transcript of the talk I delivered at TestBash in Sydney on Friday 19th October 2018.
Today I’d like to share my story about how we started with automated end to end testing at WordPress.com since I started at Automattic over 3 years ago.
I once read that it’s important to have three categories of hobbies for a meaningful life.
The first group of hobbies you should have is about making money and providing you income. This is important as we all need income to survive. My hobby around this is software testing and software quality – the reason I’m talking here today. I studied software engineering at university and have been interested in software quality ever since.
The second group of hobbies is about providing a creative outlet. We’re all creative individuals and it’s important we have ways to express our creativity. For me this has been about writing and blogging – I’ve been blogging for over a decade on both a personal and professional level. I’m also interested in street art and photography.
The final group of hobbies are about leading a healthy life: things that will keep you active and fit. For me this started out a few years ago as bush-walking/hiking which has rapidly escalated to hiking to as many peaks as I can find – I’ve already done many of the highest peaks in South East Queensland.
Whilst the previous diagram showed these hobby groupings as clearly defined, in reality this isn’t the case – there are actually blurry lines and things overlap quite a bit. For example, my creative blogging has lead directly to professional opportunities that I wouldn’t have otherwise had. Starting as a blogger on WordPress.com over a decade ago led me to working there full time today.
I work for Automattic. Automatic is an interesting company in that we’re fully distributed meaning we have no offices – people can choose to live and work anywhere they want in the world. We currently have a diverse workforce of over 800 people in 69 countries speaking over 60 languages. We set our own schedules and manage our own work.
We create and support a variety of products including WordPress.com, WooCommerce, Jetpack, Simplenote and Gravatar.
WordPress.com is my primary focus as software quality practice lead and I lead a team called Flow Patrol looking at testing tools and technology, and a group of excellence wranglers embedded in product teams across Automattic.
A couple of years ago I read this book by Mark Manson. Has anyone here read it? I imagined so since it’s a NY Times best seller. One of the key concepts – and it’s not unique to this book – it’s common across books on life – is the backwards law or law of reversed effort where constantly wanting and desiring a positive experience is actually a negative experience, and acceptance of problems or less-than-ideal circumstances is in itself an overwhelmingly positive experience. This can be applied to your life, your career, your job and even a project you’re working on – looking for one or all of those without problems is going to make you unhappy in the long run.
Another concept in the book is that you can never rid yourself of problems – solving one problem simply creates a new different problem.
Imagine for a moment that you have a regression testing problem where it takes 2 people in your team 2 weeks to manually regression test your product. You can solve this problem by creating 2000 automated end to end tests that run through a web browser. You’ve now created a new problem of maintaining 2000 end to end tests.
This will be the theme of my talk today – solving problems and creating new problems.
When I started at Automattic on WordPress.com 3 years ago teams were already practicing ‘continuous delivery’ which they’d been doing for almost a decade – before it was called ‘continuous delivery’ or you could buy a book called Continuous Delivery. That is, individuals deployed their own code to production multiple times per day.
This was working well however we had a problem where customer flows were breaking in production and we might not know about it right away. This is because whilst we had a strong culture of ‘dogfooding’ – that is consuming our own software which we use to communicate internally. The issue was that we weren’t using the critical sign up flows targeting new customers ourselves as this isn’t a candidate for constant dogfooding.
We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.
This was great as it gave us confidence for the first time that our customer sign up flows were working in production every time we made a change. But it introduced some new problems.
We use A/B testing extensively at WordPress.com to deliver new features to customers with confidence they will be beneficial to our goals. For example, we currently have an active AB test in production where group A customers are shown a sign up process where we ask for website details first and customer details last. Group B customers see a sign up process where customer details are entered first then their website details are entered last. We run this test and assign a percentage of traffic to each (say 50/50) and measure the conversion rate of each, which becomes the default sign up flow. But these cause problems for our e2e tests which are programmed to expect a defined sign up flow – so in this example our tests would fail 50% of the time.
What we did was created a way to override these tests in a browser, so for example our tests always see sign up group A, and then we make sure that new authors of AB tests are aware of what’s needed to let our e2e tests know about upcoming AB tests. We do this using a GitHub webhook that looks for changes in a pull request and then reminds the author to let the tests know about the change.
With our e2e tests running in production and taking into the account all the active AB tests we now had a problem that we still had to revert and redeploy changes when we introduced a problem. And the way to tests were running the author of the change wasn’t always aware of the problem that was already deployed to production. Plus the tests were taking too long to run.
We introduced parallelism to our test execution. We originally started parallelising our tests across docker containers but these are expensive as we pay for each container. So we also introduced process parallelisation within a docker container so we can maximise the number of tests within a paid container as the containers are very powerful and in our testing we’ve been able to run up to 12 headless Chrome processes in one container simultaneously.
We also introduced canary tests. Canary tests are a small subset of all the e2e tests – and this comes from the saying “canary in the coal mine”. In our case we highlighted two e2e tests that are our canaries – signing up for a single new site, and publishing a piece of new content. These canary tests are run as soon as code is merged into our master branch but before we deploy the change with a direct ping to the author who is merging their code.
The problem we had at this point is that whilst the canaries would alert us to a problem before we deployed, the author would still have to revert their change and merge and deploy that – since we always keep production in sync with the master branch.
And we couldn’t utilise docker container parallelisation locally.
We came up a with a way to launch “live branches” for every pull request. As soon as someone makes a change and pushes it to a branch on GitHub a fully self-contained environment is spun up using docker that is fully accessible via a URL which is added automatically as a comment to each PR.
This not only allows us to manually test changes on various devices, but it also allows us to run e2e tests against PRs. We started by automatically running our two canary tests against each PR as part of the peer review process – preventing broken user experiences being merged into the master branch.
At this stage we found the canaries tests wouldn’t find all the problems on a PR. Whilst we encourage small low impact PRs – sometimes it’s necessary to create one which will have a broader impact such as upgrading our version of React will impacts how every screen in our app is rendered.
We added a label to all GitHub pull requests which optionally allows a full set of e2e tests (around 30) to run against a live branch and report back the results to the PR.
Even with the canary tests and optional full suite tests we’d still find that our app could be broken in two problematic browsers: IE11 which is our only supported IE version, and Safari 10 which is problematic as it’s tied to older versions of iOS which Apple keeps certain older hardware from upgrading.
Our existing e2e tests are run on headless chrome browsers on Linux so they don’t find these issues.
We re-used the existing canary tests that are automatically run against our live branches and extended to run these against IE11 and Safari 10 and report back the results in a nice clean format.
Our final problem I’ll talk about today was the hardest one by far to solve, or ‘upgrade’. With everything I’ve talked about today we still had a case of people merging changes and breaking the e2e tests.
About six months ago I wrote an internal blog post and explained the situation and desperately asking for suggested solutions on how we could get people to stop breaking the e2e tests.
There was plenty of responses as comments on the post. Lots of suggestions. But each one of them we’d already tried! A week or so passed then someone left another comment.
This comment wasn’t a suggestion about what we could do. It was a narrative, it was a story.
It was a story about how this person recently merged a change that broke the e2e tests. They realise and and went about working on a fix. But shortly after someone from my team, Flow Patrol, had already fixed the test and checked in the change and the tests were already passing again. The story highlighted that this was a huge missed opportunity: we were too good at our job.
So from that day we decided on a new approach. If someone broke an e2e test instead of us immediately checking in a fix – as long as the test isn’t one of our two critical canary tests – we’ll disable the test – so our tests are green – and ping the author: “hey, we noticed your change broke this test – would you mind taking a look at updating the test – we’re here to help as much as you like”
Immediately we saw an increase in people being involved and engaged in the tests and a lot people told us they were much easier to maintain than they had though and even described them as fun!
Today we have a smaller Flow Patrol team that looks after testing tools and technology. We also make sure we have an e2e view of quality across our products. We also have testers embedded into product teams building quality into our products and processes.
Our e2e tests continue to be helpful – particularly around larger impact changes such as dependency upgrades. We have no manual regression testing.
Today I talked about how avoidance of problems is negative experience in itself: happiness comes from solving problems, and so do more problems.
When solving problems you should think in what you can do over what you ‘should’ do. People will say you should do this, you shouldn’t have e2e tests. Instead think about what you can do to solve your own problems.
Importantly tools can’t solve problems you don’t have. A wifi kettle doesn’t solve a problem. We spent time investing in setting up visual regression testing when we didn’t have a visual regression problem. We didn’t find many bugs with these because we didn’t have a problem. We wasted time and effort on this. Focus on your problems.
Continuous delivery is a buzzword but if you want to continually merge and deploy software you can’t rely on any manual regression testing. To make sure you have a great end to end customer experience you may want to invest in some automated e2e tests.
When thinking about solutions think in AND not OR. For a long time we were trying to work out whether to have embedded testers OR centralised testers in Automattic. But why not both? Why can’t we have embedded testers AND have a small central team of testers? Why do you have to be a tester OR a developer? Why can’t you be a tester AND a developer?
<a loud thunderous applause>