stephen.sh

Home About Essays

Why do we test software?

This is a short-form version of some thoughts I posted internally at Monzo.

And in particular, why is it something we should do, given all of the other things we must do?

I define correctness as the ability for the software to meet the contract it promises others, both in sickness and in health. Often, software does not meet such promises. Ensuring correctness is important: especially so at the lowest levels of the system. If we ensure the correctness of systems below, we can build more powerful abstractions above. Good abstractions are incredibly powerful - it's the very reason we write banks in Go, not assembly.

At Monzo, we do not build services: we build systems. A system at Monzo is a collection of distributed services performing some function, usually with side-effects internally (like writing to a database) or externally (like sending a payment over Faster Payments). The network - the damn network - limits our ability to reason about such side effects: for example, a payment scheme can acknowledge our request, and send the money, but we never receive the response, due to a hamster chewing through the internet cable at the exact point the acknowledgement was sent back to us. We must build systems that are resilient to such problems. They are inevitable.

But, thinking about the system as a whole is a fool's errand: it's simply too big, too complex and changes too often to keep it in your head. System complexity quickly overwhelms our ability to reason about the direct, and second order effects of changes: especially if those changes are large. Thus, when we make changes, we endorse shipping small, and often: leading to small pull requests altering a specific area of code or system. And at some point, ensuring we can spot and quickly remediate when things go wrong becomes more important than proving correctness offline.

When talking about correctness, testing instantly comes to mind as a strategy to achieve it. Testing often means many different things, to different people. In this document, testing is defined as an umbrella term of all strategies you can used to ensure correctness in your system. Unit testing, integration testing, smoke testing, randomised testing, fuzzing, Jepsen, coherence/repair services, etc - all together form 'testing'. One approach to testing is not a panacea. For example: within distributed systems, testing code directly through unit tests of code paths does not suffice: the system as a whole must be tested. I assume all services that failed a Jepsen test passed their unit and integration tests.

But why test at all? The sceptic would say that time spent testing software is time that we could be writing new features — why not trade off correctness for growth, given that VC backed companies are 'default-dead'?

Dan Luu summarises my thoughts well:

One of the famous Intel CPU bugs allegedly comes from not testing something because they "knew" it was correct. No matter how smart or knowledgeable, humans are incapable of reasoning correctly all of the time... ...Testing may not be sufficient to find all bugs, but it can be sufficient to achieve better reliability than pretty much any software company cares to.

In summary, proving correctness in all parts of complex system is near impossible, and we shouldn't aim for it. But in my experience, good strategies for ensuring correctness vastly increases the probability of it occurring. The effect of correctness compounds over time: engineers feel more confident making changes, adding further testing becomes easier due to prior art, etc. Poorly tested, flaky software has negative compounding effects: engineers become scared to touch things, old cruft lies around for too long as it's impossible to tell who depends on it, knowledge calcifies in the veteran engineers that know what to worry about, and what not to.

A focus on correctness, coupled with the ability to spot when things are not correct, begets an significant increase in medium and long term speed.