MLOps — When Training The Best Model Is Not Enough.

  • by Bartosz Blazejewski
  • 29 July 2021
  • 7 minutes

The job is done not when “the model is built”, more when “the client experienced predictions and was satisfied”

If you are reading this, chances are you have already realized that solving business problems with Machine Learning is not as straightforward as it may seem, and it is not exclusively about models. According to Gartner’s predictions, “Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization” and Transform 2019 of VentureBeat predicted that 87% of AI projects will never make it into production. You can turn this +80% ML project failure rate with an enhanced DevOps process and tools which help you with traceability, data management, and tailored infrastructure.
In the last few months, you could start hearing more frequently about a new trend that helps businesses overcome those problems, which is MLOps. This is a concept to bring DevOps-like practices into the Machine Learning field. It focuses on making the process of development, release, and maintenance of Machine Learning systems efficient and reliable. To understand better the meaning and sudden popularity of MLOps, it’s worth briefly analyzing the bigger picture.

Why MLOps is becoming popular?

In the last years, we have seen tremendous technological growth. Big data is widely available, many advanced Machine Learning algorithms are implemented in the popular Python (and not only) libraries, the hardware is much cheaper and specialized (e.g. TPUs for neural network training) than a few years ago, and access to on-demand computing was never easier than now, with cloud providers. Yet, as mentioned at the beginning, in the majority of the cases models built by data scientists never make it into production.

Elements for ML systems. Adapted from Hidden Technical Debt in Machine Learning Systems.

Elements for ML systems. Adapted from Hidden Technical Debt in Machine Learning Systems.

While ML code lies at the heart of all the AI solutions, it’s just a tiny piece of the entire system needed to deliver reliably high-quality Machine Learning solutions to production.

Let’s take a step back and think about how Software Development looked like a few years ago. I still remember big monolith applications, manual verification of what is going to be released, and a bit of uncertainty if everything will go well during mostly manual release during which system is not available. It changed drastically. Now, with the rise of microservice-based architectures and cloud solutions, what often matters is the speed of development and frequency of zero-downtime releases. In fact, many companies continuously release code to production throughout the day. It all became possible thanks to huge progress being made in the DevOps area and having many mature open-source tools available like Git, Docker, Jenkins, Kubernetes to name a few that can be easily adopted by any organization.

MLOps is about 10 years behind DevOps

MLOps is about 10 years behind DevOps

Compared to Software Development, Machine Learning is not only about code but also about data. This adds a lot of complexity and creates a need for a separate group of tools and processes to make development, release, and maintenance easier. Applying Machine Learning to business problems becomes a norm in software companies that is why MLOps is emerging to define how to do it efficiently and effectively.

What exactly is MLOps?

The definition from Wikipedia describes it as a “Practice for collaboration and communication between data scientists and operation professionals to help manage production ML lifecycle”. At the same time, Ops stands for Operations so what do Machine Learning Operations have in common with collaboration and communication between two groups of people? It’s easy to get confused here, so let’s try to explain it better.

When you want to apply Machine Learning to solve a business problem you have to perform many Operations. Those Operations span across people of different profiles. In the mature use case, it’s not just the two groups, we can list the following profiles:

  • Subject matter experts, who provide business questions and goals, and can validate if the model performance aligns with the initial need
  • Data engineers, who make sure the necessary data are available for data scientists
  • Data scientists, who build the model that addresses business questions and goals, deliver operationalizable models and can assess model quality
  • Software engineers, who integrate machine learning models in the applications and systems
  • DevOps, who integrate ML-based systems with CI/CD pipelines and set up monitoring
  • Model risk managers / auditors, who try to minimize the overall risk of the company using ML models in production and validate compliance with regulations before the ML model is released to production

Of course in many cases, the setup is simplified – you don’t need model risk management or few functions are executed by the same person. In any case, even the most accurate ML model produced by a data scientist might not make a viable business impact if the software engineer has made a small mistake while preparing input data for the model or simply if the source data prepared by the data engineer became stale before the model was released. This means that the job is done not when “the model is built”, more when “the client experienced predictions and was satisfied”. There are so many scenarios in which things can go wrong, that is why the communication and collaboration between all the people executing the Operations is the key to success.

MLOps is about applying DevOps practices to Machine Learning

MLOps is about applying DevOps practices to Machine Learning

Compared to DevOps, MLOps introduces many new challenges, among which the most important are:

  • Keeping good communication and collaboration between more people of various profiles
  • Reproducibility of workflows and models (requires versioning of not only code but also of data and models)
  • Testing and validation strategy for models (beside statistical metrics, it’s best to follow Responsible AI, to train a fair model with an appropriate level of interpretability)
  • Connecting the feedback loop (to re-use information about good predictions and mistakes to further improve the model)
  • Model monitoring (to detect when models’ performance decreases)
  • Continous training (to automatically react to model’s performance degradation)

Why MLOps is so important?

As DevOps drastically improved the software development process, it seems that MLOps is becoming a practice of a similar impact for ML-based systems development. It can help you as an individual and your organization to avoid common pitfalls in the complex process of releasing ML to production and having it maintained, to join the successful minority. It consequently will save you time and money. The correct implementation of MLOps will make the work of the team more productive, improve the speed with which you can deliver high-quality ML models, and make the optimization of business metrics a priority. Being able to iterate with models faster, you will become more competitive in your domain on the market.

How to start?

Building a product according to MLOps principles means that everyone involved can easily communicate and collaborate. This is the easiest to achieve with cross-functional teams, well-defined processes, and centralized assets. The last boils down to using the same tools giving a holistic overview of the Machine Learning model lifecycle, simply the same platform.

Start by identifying your needs (part of Unit8’s webinar)

Start by identifying your needs (part of Unit8’s webinar)

It is very important at the beginning of your journey that you spend enough time identifying your exact needs. The MLOps implementation will differ depending on the size of the Data Science department, needed frequency of model retraining, and compliance & regulatory requirements. In fact, even if you manage to identify your situation correctly, it might not be that easy to move forward. Besides non-trivial automation of building models, just a selection of the appropriate platform itself is a very difficult task taking into consideration the situation on the market. The big players like Uber or Facebook have built their own end-to-end platforms (Michelangelo and FBLearner), main cloud providers offer their own solutions (Azure ML, AWS Sagemaker, Vertex AI), many enterprise platforms are targeting different user groups (e.g. DataRobot, H2O), and last but not least, there is a growing number of open source solutions which start being used by technology companies e.g. Kubeflow by Spotify or MLFlow by Yotpo. It is easy to get lost deciding on the platform that will satisfy your needs and as a consequence to make a suboptimal decision that will impact your budget. If you need help with MLOps implementation or just advice on the platform (we have tested a majority of them), we are happy to help. Check out our website.

TL;DR

You need to bring DevOps-like practices to Machine Learning (MLOps) to improve the chance of making your project successful.

  • ML code is just a tiny piece of the system that has to be built.
  • MLOps improves collaboration and communication between Data Scientists and other groups of people essential to run the project end-to-end.
  • The job is done not when “the model is built”, more when “the client experienced predictions and was satisfied”.
  • MLOps helps you as an individual and your organization to avoid common pitfalls in the complex process of releasing ML to production and having it maintained.
  • Start by identifying your needs. The MLOps implementation will differ depending on the size of the Data Science department, needed frequency of model retraining, and compliance & regulatory requirements.

As a complementary resource, I recommend you also to watch the webinar, where I introduced MLOps by analyzing its various aspects with easy-to-follow examples.

Stay tuned for the next article that will analyze the impact of the MLOps on the architecture of the ML systems.

Thanks to Marek Pasieka and Andreas Blum.