Home About Essays

Why is building machine learning systems so hard?

Building machine learning systems in 2019 feels like stacking together Lego. Except, you have to construct all the Lego pieces from scratch. In the dark.

Setting the scene

This essay starts by defining what I mean by a machine learning system. I then discuss why machine learning systems have high essential complexity: complexity that cannot be reduced, and is inherent to the system. Given this complexity, I then provide arguments for dimensions that can increase or decrease system complexity. I conclude by musing about the future of building these systems in an applied context.

'Why does this guy feel qualified to talk about machine learning systems?', you might be thinking. I spent 4 years at Ravelin building a real-time fraud detection system with a bunch of very smart and friendly people: please go and work there if you're also smart and friendly. The product had machine learning models at the core, and was surrounded by a significant amount of infrastructure required to enable and support them. The system was complex, but was necessary to achieve our great performance. I saw first hand what is required for machine learning to succeed, and debugged the many strange ways that the system can fail. I now work at Monzo, where I work closely with the Machine Learning team.

My experience is building applied machine learning systems: systems that intervene in the real world to help make some decision and take some action. Academic developments are interesting, but I'm quite far away from that field. I am a pragmatic fellow, and my passion is in building things that provide value to people. Please view the rest of this essay with that lens. Let us begin.

What is a machine learning system?

Imagine a box, where some data goes in, and a prediction comes out. This is usually a probability of the data belonging to some class (classification), or some estimate of a continuous value (regression). This prediction can be used by itself, or an input into another product or system. The 'system' is the model that generates the prediction, plus the necessary layers of infrastructure on top required to give the model the required input data, interact with it and make it useful and safe. The surrounding infrastructure varies in form: monitoring systems, feature serving services, A/B testing, services to serve the model, tools to introspect the model, shadow model systems to compare new models - the list goes on. The model itself usually forms a very small part of the system; say 5-10%.

OK, that's what a machine learning system is, but why would I use it? Given that current media discussion of artificial intelligence badly misrepresents what most machine learning systems in business do, I'll attempt to reify it:

Machine learning is ever-improving automation of some task through historical pattern recognition learnt from data, at next-to-zero marginal cost per execution of that task.

Let me unpick the parts of that statement:

The current zeitgeist of valuable technology companies - Facebook, Google, et al - are machine learning companies at their core. Yet, these systems are complex to build, operate and understand. Why is that?

Deterministic vs. probabilistic

To build a machine learning system is to accept a quite different way of developing and running software. This way is more complex, and there's not a whole lot you can do about it.

Traditional software is deterministic. Think state machines, functions, unit tests. Given some input, we expect the same output, and we can prove when our algorithm is incorrect with a few simple tests. We built these things: we can usually debug them. There are obviously exceptions to this heuristic when in complex systems — think CPUs, or distributed systems - but on the whole, the system is usually transparent.

Not so much with machine learning. These systems are probabilistic programs learnt from data. They may update their behaviour in realtime with new information. Given some input, we may expect different output on each subsequent execution. Take recommender systems. If I watch a single episode of Parks and Recreation, and re-watch it, the system will be more likely to recommend me more Parks and Recreation than it had been the first time.

We encourage the program to give the right answer by giving it examples of the patterns we wish to spot. But, we cannot demand that it does so. Our main lever of improvement is through training data. We can improve the quality of it through better labelling or cleaner data, we can get more of it somehow, or we can extract better signal from it through feature engineering. Bets made here are less certain about whether they will pay off. It can also be very expensive to acquire more or better data. For example, imagine building a system to predict whether a customer will default on a mortgage you give them. You need to lose tens of thousands of pounds to gain a single example of a customer that defaulted on their mortgage.

Deterministic systems are directed. Probablistic systems are encouraged. This difference leads to a significant increase in irreducible system complexity. It becomes hard to reason about the worst case behaviour of your system (which is why machine learning models are almost always surrounded by rules). It is challenging to see where improvements in performance may come from, compared to running CPU profiles on a deterministic system. You become an engineer of a dataset, as opposed to an algorithm.

Let me underline the point: machine learning systems are programs learnt from data. If we learn the program from data, then the performance and behaviour of the program isn't totally within our control. The program comes with all the warts, inaccuracies and biases inherent in data. Never forget it.

The problem with building time machines

Any machine learning requires some input data in order to generate a prediction. These might be images of cats, user content in text form, or numbers that you calculate such as how many purchases a user made in the previous 24 hours. The input data can be raw - for vision or language models - or a vector of numbers that represent some signal. The latter is more commonly used in 'business' applications of machine learning: businesses tend to have large amounts of well-structured data.

I believe the single biggest determinant of system complexity is the form of input data that your model requires.

Here follows some questions to ask of any system you build:

If you answered no to all questions, I am forever jealous. If you answered yes to all questions: congratulations! You're going to spend a long time building a mightily complex machine learning system. See you in a year or so.

System complexity increases when the infrastructure must construct the features. Dramatically so when these features contain aggregated, realtime data. But why, I hear you ask? Let's go through an example.

Imagine we're building a machine learning system to spot credit card fraudsters amongst genuine customers. First, we need to decide at which time point we want to extract a set of features for training. In this instance, we'll go with generating a feature vector every time the user attempted to make a transaction.

Second, we need to decide what features our model is going to use. We don't have the luxury of having models that can build their own features like language models. Thus, we're going to engineer features from actions that that customer has taken, like attempting a transaction, or adding a card to their account. These features will appear as a vector of numbers, and we'll train models on them. These features will appear in two categories:

In this toy example, we'll count the number of vowels in the user's email, and the number of cards the user added to their account in the 3 hours before the transaction's timestamp.

Unfortunately, to support the latter feature, we need to build a time machine. For each customer, and for each transaction, we want to emit a set of features that only takes into the information we would have known at the timestamp of the transaction. If we use the information we know as of today, then this causes data leakage, and will ruin the performance of your model in production. Building this time machine is hard, requires high quality data, and extensive testing to ensure the system is correct.

You do have an escape hatch. You can avoid having to build this time machine by logging the features that your live system emits, and training on those. But, this kills development velocity. You need to write code to engineer a feature, deploy it, wait for a few weeks for enough training data to come through, and then train on the output. Your new feature may be useless, and it's taken you a month to figure that out. It also has other problems: your model may exploit the fact that your feature only started appearing a few weeks ago to exploit a time-dependent false pattern. Perhaps there's a lower rate of fraud in the past few months, and your model uses your new feature as a proxy to this.

Time snapshotted, aggregated features are very hard to get right, and introduce a large amount of complexity. More-so when you need to calculate and serve them in realtime with total accuracy. Doing this correctly and efficiently was one of the most significant engineering challenges at Ravelin. Know what you're getting yourself into. Avoid historical state if at all possible, and prefer models that can engineer their own features.

Feedback loops

We generally build machine learning systems to intervene in the real world and make decisions on our behalf. If these decisions are useful, you should assume there will be a feedback loop: some side effect of making a decision that may have unintended consequences on the future performance of your system. In some machine learning systems, your future performance is a function of your present performance.

Imagine a spam detection system. The very point of its existence is to prevent spam. By intervening and blocking spam, the system sees fewer cases of spam. Hooray - the world is a marginally better place.

But, not so much for the next spam model, and by definition, you. If we train a model, you're training the model on the hardest cases of spam - the spam that the live system didn't manage to spot. Over time, this type of spam will represent an ever larger fraction of your training set, and your model will learn to spot this type of 'hard spam'. But, it will forget the easier type of spam that the current model is spotting, as the easy spam represents an ever smaller percentage of your training set. Your new model starts missing obvious cases, like Viagra adverts, or emails from Nigerian princes. Performance degrades.

We've hit the problem of the counterfactual. If your system intervenes in the real world, it prevents itself from obtaining the true label, and biases the future training data. I describe this as 'eating your future training set'.

There are ways to adjust for these loops. When you think something definitely is spam, then don't intervene, with some low probability. You now have a counterfactual example: an example that shows you what your system said, and what actually happened. This gives you many benefits. Firstly, it allows you to estimate precision and recall on live data. Additionally, it gives you a source of unbiased training data. See this video describing how this is implemented at Stripe.

Yet, this comes at the cost of more system complexity. More often than not, these systems aren't implemented due to complexity, or political reasons. You can only imagine the conversation with executives: 'anti-spam team, what do you mean you don't want to block all the spam?!'


I've come out of this essay sounding nihilistic. I'm not. I'm very bullish on the future of machine learning systems. It's early days yet. There's so much still to build. We have solved perhaps solved small single digit percentages of the problems that could be done much better with machine learning systems. But, building machine learning systems in 2019 feels like writing Javascript 10 years ago. There's so much still to build.

But, machine learning frameworks are widely available, I hear you cry. That helps a huge amount with getting off the ground with a set of high quality, performant models. I don't mean to belittle the contribution that these projects have made. I estimate scikit-learn alone has created tens of billions of dollars of economic value to the world. My point is that models are commoditised - your random forest is my random forest. These models form such a small part of the total system itself; the data and system as a whole is where you get an edge. There is a severe shortage of re-usable components in machine learning systems. I find myself building servers to serve models, registries to store them, tooling to introspect them and systems to monitor them wherever I go.

Part of the complexity is due to the fact that this is a new way of thinking about how we build software. We don't yet know what efficient software engineering practises look like for machine learning systems. How much can we take with us from how we build software today? What does change management look like for interdependent machine learning systems? How should a team collaborate on building a neural network? I have my opinions on these subjects, and leaders in the field like Andrej Karpathy have thought about this. Yet, we don't have best practises to follow as an industry.

Feature engineering infrastructure remains a massive source of complexity. Part of the reason that deep learning systems have been so successful is that they can do away with a huge swathe of it. For structured business data, I don't yet see a way forward to get rid of these systems that doesn't trade off significant performance.

Right now, it feels hard to build machine learning systems, and awfully hard to build certain types of them. Perhaps it's a truism: complex systems are generally hard to build. In any case, I hope best practises around building production machine learning systems become more well-defined, and more often discussed.