INTRODUCTION TO MACHINE LEARNING PIPELINES
In this blog, we will discuss what pipelines are and why they are a fundamental unit against which the value of your ML investment should be measured.
Thanks to the news-making publications from research bodies like DeepMind and successful use-cases of applying Machine Learning (ML) in the field, ML is now not only a buzzword (which it still is) but also a real tool used by companies to supercharge their business processes. More and more industries now see that ML can be effectively applied in their domain and job postings for data scientists now feature more than tech companies and banks — manufacturers, government agencies and publishers.
Data science teams are built to bring value to companies, but many discover that it is not a warranted outcome. Bringing ML from PowerPoint presentations to Jupiter notebooks was a bit of a step but getting from that to value is not an obvious transition. It requires taking a step back and thinking not of algorithms, not of models, or even of datasets, but of pipelines.
On Complexities of an ML development process
At first, let us recap the complexities of ML development. We will briefly go over the fundamental limitations of computers and real-world data, then draw a bit more focus on the problem that actually can be addressed within one organization.
I’m sorry, Dave, I’m afraid I can’t do that
Computers are not particularly smart, as in smart as a person Billions of operations per second do not cover for the fact that to make the computer do something that you want requires precise and correct instructions. In the context of ML, we can’t just load all the data that we have about our business, open a high-speed internet connection and wait for the ML model to ‘figure out how to reduce our costs.
It is necessary to define a precise, measurable use-case, that can be addressed with available data. For example, if I have hourly data about the temperature of the lubricant on a production line, and the log of all the times the line has faulted, can we make a predictive algorithm to know of the fault one hour before?
What’s worse is that even if we can, how are we going to lower the cost, given that we have a prediction. ML is a powerful tool, but it requires strong human analytical skills to make it useful in an industrial setting.
Machine learning tutorials on the web are demonstrating the power of ML using either artificially curated datasets, such as mnist. These datasets were created specifically for training machine learning algorithms and are used as benchmarks, which is a marvellous invention. But sometimes they build unrealistic expectations about ML performance.
Real data is far from perfect, even in the IoT space, where the data collection is far from a person entering ‘02/12/21’ instead of company standard ‘2021–12–02’, there are problems. Sensors have noise, sensors are physical things, they get contaminated, disconnected, power gets cut.
Data (almost) always need downstream processing before it can be used for ML, sometimes with a great reduction in size. It is somewhat an obvious point, but perhaps the old mantra ‘model is only as good as your data’ misses the human agency and we can add ‘your dataset is only as good as your preparation of data’.
Small Leap for the Stakeholder…
In the introduction, we stated that a leap from the model in jupyter notebook to a model bringing value is not easy to make. That may be counter-intuitive for the stakeholder, the model is there, the results are good, let’s use the results!
Well, suppose our model running inside a python script can give helpful results, but we still run it manually. That might be ok (which it is not) to run it manually if you need to run the model, say, once a day, just relying on the discipline of the data scientist. What about IoT data, which is streaming sub-second data and the model needs to make a prediction every 10 seconds? Surely, there are automation tools to do this, a stakeholder thinks, and they are right.
But using these tools, more so, using them correctly and effectively is a lot of effort. Especially considering that many problems applicable to field data also apply to the models in the field: things do go wrong.
Another somewhat obvious idea, represented on a diagram below (pic) is that actual algorithm development takes a small bit in the ML development process. The bulk of the process is taken by the engineering effort to build, deploy and monitor ML solutions and data scientists alone cannot hold all the ends of these ropes.
Now that we are aware of these problems, let’s talk about pipelines.
Pipeline as a representation of the chain of events
When you hear the word ‘pipeline’ you might be imagining water pipes running through your home or oil pipes running across thousands of kilometres across countries. These concepts are not far off. We can conceptually define a pipeline as a sequence of data transformations that are connected.
For example, data is taken from the sensor into a database, then it is loaded from the code to train a model on it. Tere we have a pipeline: Collect->Store->Train.
To talk more about ML pipelines and their properties, let’s introduce an analogy.
A water supply
Let’s imagine that we want to bring water from a water source (like a lake) to the consumers. Before I go into this analogy, let me apologize to any engineers in the field of water supply, I am not claiming to know how these systems work, but I think the way I am describing a system there is useful for understanding ML pipelines and can be understood by anyone. Back to our water supply.
We have our water source, then we pump the water through a debris filter, to clean out weeds and stuff like that. After that debris-free water goes through a finer filtering system, which takes out smaller unwanted particles. Then our clean water can be enriched by minerals that are naturally lacking in the area of supply. Finally, clean, mineral-reach water goes through quality control to make sure its state is suitable for consumption.
That process can be directly mapped to a ML model training pipeline! We have our data store instead of the lake (yes, it can be a data lake, thanks for asking). Then in the data cleaning stage, we remove outliers and missing values (debris). By doing feature engineering we are removing features that are not useful to our final goal (particles) and combining some other features to have more useful dense information. (Note, although the last point has no corresponding process in our water pipeline, remember, it is just an analogy, it can only go so far. So let’s focus on the bigger picture here.)
Finally, by training a model on our data, we are enhancing the value of our data by generating predictions or insights (adding minerals). Finally, we evaluate the model to make sure we do indeed produce the result that is of satisfactory quality (quality control).
Clean water in every home
It seems that we are ready for the water to be delivered to the households. But we don’t want to stop checking the control of the resulting water.
While it might be obvious for the water supply, why would we want to keep checking the model once its performance was validated?
Precisely for the same reason, we would be checking the water — the world is not static. The data store, much like a lake, is not a static entity, the information is being loaded there and this information is coming in from the real world and the world changes.
Now that we think of it, perhaps it would be better to spot changes to the initial conditions (water/data) before the information is propagated through to the quality control. Perhaps, if we notice the change in the water quality earlier, we can put more advanced filters into our system, or change the type of mineral we are trying to add. Read: engineer new features, change model parameters. In the machine learning world that is called Data Drift Monitoring.
The model was trained on a model ‘looking’ a certain way and if the way data ‘looks’ changes (drifts), it’s a warning that model degradation might soon be observed. So our pipelines will look like those in diagram two.
Extending Existing Pipeline
Hopefully by this time you recognize why pipelines might be such a useful way to think about machine learning systems. Instead of thinking about a piece of code floating in the cloud, it is a system of data transformations.
Each of the steps is a separate process, running in its separate environment, connecting to the other pieces it needs to interact with.
By recognizing the information flow of a pipeline we can think of adding complexity, and checking how effective is our current arrangement.
For example, if we wanted to check just how effective our particle filter (feature engineering) is, we build a branch of a pipeline for the water (data) to bypass it and go straight into the addition of minerals (ML model training) and then evaluating the quality independently (diagram 3).
We can get another branch to test a different filter instead and compare the effectiveness of the two against each other and the benchmark of not having a filter at all and so on.
By constructing these workflows we can identify the interfaces between components and design our systems more easily while making them more robust and complex.
Now, playing a devil’s advocate, while having pipeline as a mental model, we can construct the whole pipeline inside one piece of software and make this run on the cloud, why should we bother separating the bits and have many processes run simultaneously. Wouldn’t it bloat the cost of development as well as increase the number of points of failure? I am glad you asked!
Workflow pipelines versus pipeline-in-code
There are a few arguments for separating the ML workflows into a separate component, i.e. pipeline steps. Below we will briefly discuss the main ones.
- Less coupling
In their seminal book, Pragmatic Programmer David Thomas and Andrew Hunt talk at length about problems of coupling, when different parts of the system depend on each other and change in one part will need significant changes in other parts as well.
If all of the pipelines are in one piece of code, all the different steps will be coupled. If one piece breaks, everything breaks. Also, if the interfaces between components are not clearly and rigorously defined (which they are often not in the race of deploying a workflow into production), changing the code in many different parts simultaneously will be required to fix problems, introduce new features, and audit the program. By having every module running in a separate environment, we can define a standard set of interfaces to connect them and changes/faults in one part of the workflow will not affect the rest of the codebase.
In a distributed architecture it is much easier to add/remove functionality and extend the existing framework, just by the virtue of the decoupling. Data scientists don’t have to go untangle every bit of code to add a new part, they can just define a new module and specify how it connects to others. It might not be as easy as it sounds, but much easier and cheaper (when we count data scientists’ time) to extend existing workflows. So, addressing the question above about the cost of the development, yes, it will be higher initially, but in the long run, you will find your system much more agile and prepare for the challenges of an ever-changing landscape of ML and computing in general.
Moreover, if you think about onboarding new members to the team, instead of understanding a big complicated piece of code, they just have to learn the interfaces the team uses and can bring value much quicker.
Finally, having small components with a specific purpose and standard interfaces makes them reusable, which means your team can assemble new workflows from existing pieces which will reduce the problem of doing the same work twice as well as, again, accelerate the scaling of the ML system.
Back to ones and zeros
It was promised in the introduction that we will talk about pipelines that bring value to the stakeholders, but so far, we have only considered model training pipelines. We have held out on the inference (when the trained algorithm does the job it is intended to do on the unseen data) because that is the point where our water supply analogy breaks.
Unlike water mineralization, model training does not yet provide the desired outcome, it merely prepares another data transformation step that then should be used accordingly.
Hopefully, by now we are all thinking about pipelines so it would be easy to extend what we’ve seen before and come up with how our inference pipeline will look like (see diagram 4).
The important thing to note here is that before inference the data has to come through all the same stages as before the training, which makes the concept of modular reusable components even more appealing.
Moreover, an architecture like this will allow us to automate dealing with the data drift. When our monitoring component at the top will sense a sufficient change in the data landscape, it can trigger the training pipeline, producing a new model.
This model will be more relevant to the current circumstances since its training set will feature more relevant data. After the new model is validated, we can deploy it into production for inference, all with zero human intervention.
For now, the word ‘explainability’ is underlined in red in most word processors, but that might change soon since it is a name of a feature of machine learning workflows: how much of what is going on can be explained.
The paradigm of treating ML models as black boxes is going away is being replaced by a more socially and ethically conscious search for explainability. Details of the process of explaining the models work in different use cases will be a subject of another blog post, here we merely want to point out that the explainability component will son find its way into your ML workflow.
An exercise of adding explainability to the pipeline we have constructed so far is left as an exercise to the reader.
Challenges of building and maintaining pipelines
Pipelines are great, but they are not going to be a silver bullet for all your ML problems. They are merely a better mental and architectural model for your business.
Having said much about the advantages of distributed containerized workflows, it would be irresponsible not to mention the challenges that come with it.
Setting up the systems that would allow your data team to make effective use of pipelines is not a task for a small team of data scientists.
Of course, I don’t want to speak to all DS professionals, but DS roles do not usually entail dealing with cloud architecture, so they lack experience and interest. That is not what DS is, they want to find patterns, mix datasets, build models, as well they should since without these activities your pipelines will be just pumping dirty water back and forth. You need to add data engineers (yes, plural) to your team to enable data scientists do their work.
And even having a team of engineers does not guarantee acceleration of your process, because of the point we will discuss next.
Too many options
The market is hot with products and services that are promising you to solve some part of the ML workflow issue. There would be multiple options for providing cloud, managing compute resources, managing docker images, test, accelerating experimentation and so on. It is impossible to test, or even, understand the advantages and disadvantages of all the combinations of these systems.
Competition is great, the market will identify superior offerings and weed out less reliable, buggy, overdesigned products (well, in theory, at least). So no one is proposing legislative standardization. The truth of the matter, though, is that we can’t wait for this to happen. If you don’t want your organization to run behind the train that has left years ago, decisions need to be made and they need to be made soon.
They are not going to be perfect, but you need to avoid bad ones, realizing a mistake too late can cost months of development time, pushing you back in your pursue of ML-driven value.
ML is a Capital Asset
Some organizations get by without having complex architectures, their DS teams assemble what they can use local or small cloud servers, doing their best to bring value to the company. I applaud those teams, but my face is sad because there is a cap on what can be achieved in that way. This approach is simply not scalable.
To use more of the data hidden treasures, more workflows need to run in production, they should be monitored to prevent wrong insights creeping into your decision making.
At the other end of the spectrum, companies invest heavily in the engineering side, hiring data engineers, programmers and cloud architects to build a new shiny system for ‘all things ML’.
What these companies sometimes lack is the focus on the actual data science and data scientists. Did your data science team contribute to the design of the system? How long would it take for them to learn to use it? Can they break it? And, perhaps, most importantly, how many hard problems a bigger, stronger data team could’ve cracked if the resources spent on the engineering were spent on data science or improving data collection?
It is important to remember, that when you are introducing ML into your business decisions, algorithms and data and people who are developing those algorithms are capital assets. It is unnecessary to spend resources on building a bespoke workflow system, when there are offerings from companies who did consult with the data scientists, did make sure their systems are hard to break and would allow your data team to do the things that actually super-charge your business.
If you are not a tech company, building a heavy system that you would have to support is a liability, ML algorithms trained on your data are assets — it is better to focus on them.
Perhaps some of the thoughts in this article may seem to contradict each other. Let’s go over the main points.
- ML development is challenging because it is not trivial to define a problem that can be solved by ML with the available data, data quality is often an issue, and putting model into production, where it can robustly deliver value is harder than it seems.
- To the latter point, many organizations forget to put required emphasis on the engineering and production parts, hence their ML never leaves the laptops of their data scientists.
- To address these complexities, it is worth looking at ML use cases, not as abstract pieces of code doing magic calculations, but of a series of data transformations, configurable and extendible if done right — pipelines.
- Pipelines way of going about ML makes the system easier to build, easier to maintain, easier to enhance and easier to understand for stakeholders on all levels.
- Pipeline systems, however, are not simple by themselves and require substantial thought and engineering effort to build and maintain
- Heavy investing in building these systems in-house might result in a draw of focus from the data science efforts to bring value to the company via ML, and introduce more problems than it solves
- Investing in a ready solution, from a company with real data science experience can help you to get the best of both worlds. You will get the system that will allow your data scientists to be productive, while support and system management will be carried out by external professionals