INTRODUCTION TO MACHINE LEARNING PIPELINES

On Complexities of an ML development process

At first, let us recap the complexities of ML development. We will briefly go over the fundamental limitations of computers and real-world data, then draw a bit more focus on the problem that actually can be addressed within one organization.

I’m sorry, Dave, I’m afraid I can’t do that

Computers are not particularly smart, as in smart as a person Billions of operations per second do not cover for the fact that to make the computer do something that you want requires precise and correct instructions. In the context of ML, we can’t just load all the data that we have about our business, open a high-speed internet connection and wait for the ML model to ‘figure out how to reduce our costs.

Field Data

Machine learning tutorials on the web are demonstrating the power of ML using either artificially curated datasets, such as mnist. These datasets were created specifically for training machine learning algorithms and are used as benchmarks, which is a marvellous invention. But sometimes they build unrealistic expectations about ML performance.

Small Leap for the Stakeholder…

In the introduction, we stated that a leap from the model in jupyter notebook to a model bringing value is not easy to make. That may be counter-intuitive for the stakeholder, the model is there, the results are good, let’s use the results!

Pipeline as a representation of the chain of events

When you hear the word ‘pipeline’ you might be imagining water pipes running through your home or oil pipes running across thousands of kilometres across countries. These concepts are not far off. We can conceptually define a pipeline as a sequence of data transformations that are connected.

A water supply

Clean water in every home

Extending Existing Pipeline

Workflow pipelines versus pipeline-in-code

There are a few arguments for separating the ML workflows into a separate component, i.e. pipeline steps. Below we will briefly discuss the main ones.

Back to ones and zeros

It was promised in the introduction that we will talk about pipelines that bring value to the stakeholders, but so far, we have only considered model training pipelines. We have held out on the inference (when the trained algorithm does the job it is intended to do on the unseen data) because that is the point where our water supply analogy breaks.

Explainability

For now, the word ‘explainability’ is underlined in red in most word processors, but that might change soon since it is a name of a feature of machine learning workflows: how much of what is going on can be explained.

Challenges of building and maintaining pipelines

Pipelines are great, but they are not going to be a silver bullet for all your ML problems. They are merely a better mental and architectural model for your business.

Engineering Overhead

Setting up the systems that would allow your data team to make effective use of pipelines is not a task for a small team of data scientists.

Too many options

The market is hot with products and services that are promising you to solve some part of the ML workflow issue. There would be multiple options for providing cloud, managing compute resources, managing docker images, test, accelerating experimentation and so on. It is impossible to test, or even, understand the advantages and disadvantages of all the combinations of these systems.

ML is a Capital Asset

Some organizations get by without having complex architectures, their DS teams assemble what they can use local or small cloud servers, doing their best to bring value to the company. I applaud those teams, but my face is sad because there is a cap on what can be achieved in that way. This approach is simply not scalable.

Conclusion

Perhaps some of the thoughts in this article may seem to contradict each other. Let’s go over the main points.

  • To the latter point, many organizations forget to put required emphasis on the engineering and production parts, hence their ML never leaves the laptops of their data scientists.
  • To address these complexities, it is worth looking at ML use cases, not as abstract pieces of code doing magic calculations, but of a series of data transformations, configurable and extendible if done right — pipelines.
  • Pipelines way of going about ML makes the system easier to build, easier to maintain, easier to enhance and easier to understand for stakeholders on all levels.
  • Pipeline systems, however, are not simple by themselves and require substantial thought and engineering effort to build and maintain
  • Heavy investing in building these systems in-house might result in a draw of focus from the data science efforts to bring value to the company via ML, and introduce more problems than it solves
  • Investing in a ready solution, from a company with real data science experience can help you to get the best of both worlds. You will get the system that will allow your data scientists to be productive, while support and system management will be carried out by external professionals

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
The Data Analysis Bureau

The Data Analysis Bureau

We are a Data Science and Data Engineering Innovation Agency specialising in Machine Learning.