Book Review: Machine Learning Logistics: Model Management in the real world
I do not typically review books, but the following book had a profound impact on how I saw DevOps and Data science. It is a fantastic book and one I recommend to all. A few years ago I picked up Nathan Marz’s Big Data book, in which Marz introduces the concept of the Lambda Architecture. I have since used a variation of Marz’s design in most analytical projects. Lambda changed the way I approached big data. Rendezvous has had the same effect for Machine Learning.
You can download “Machine Learning Logistics” from MapR’s website. https://mapr.com/ebook/machine-learning-logistics/ The book is about 90 pages. There is also an accompanying three part video series, delivered by Ted and Ellen, which is available on YouTube.
“Machine Learning Logistics: Model management in the real world”. There is a lot packed in that short title – I will explain more as we proceed. I do want to call attention to “Model Management”.
The process of managing a machine learning model once in production is known as Model Management. Vartak et al, describes model management as “The problem of tracking, storing and indexing large numbers of machine learning models so they may subsequently be shared, queried and analysed” (Vartak et al, 2016). Vartak et al’s definition does cover some of the key points, however it falls short in a few key areas.
Scully notes that there are many reasons why machine learning in production is problematic (Scully, 2015), which would make Model Management difficult.
Time constraints to build – The amount of time required to get a model from idea in to production is a lot.
Time constraints once deployed – A model needs to respond in x number of milliseconds. This will vary from model to model. A good benchmark is 100ms. If a model is in front of a customer on your website and it takes anything longer than 100ms to respond, then the user will notice the delay.
Data changes very quickly – A model may fit the business at the time it is trained, then the business may change significantly and the model no longer reflect reality. This is common and happens in most data-intensive applications. This is known as Model Decay. I will write more about this in the coming weeks.
Production machine learning needs to be able to integrate with other applications – Once a model is published there maybe any number of ways it needs to interact with data from the business.
Machine learning needs to be robust – It needs to continue to work in production and require little maintenance.
Live machine learning does not always work as expected – A local model may work well then when published may not have the uplift that was expected.
If one data science team develops in R and the application it is being deployed to works in Java, how do you get these two teams to work together – this is an extension of the development vs operations problem.
Valentine and Merchan note that a Model management / DataOps process should support any data science language (Valentine & Merchan, 2017), however they neglect to discuss the range of languages. Data Science covers a broad spectrum of languages, tools and environments.
A data scientist can train models in R, Python, Scala, Java, C++, Lui, H20 or many other languages. Dunning and Friedman note that when asked “what is the best tool for machine learning” they found that there was no one answer (Dunning & Friedman, 2017). In their study, the smallest number of tools used was 5 and the largest number was 12 (Dunning & Friedman, 2017). How many languages do you work with? I would love to know – leave a comment.
Dunning & Friedman note that this is due to the development practices in machine learning, “The tool that is optimal in one situation might not be the best in another” (Dunning & Friedman, 2017). Therefore supporting multiple languages is important and their interactions should be decoupled to the point that the user is not aware of the language used. As a Data Scientist, I do not want to be told which language I need to use. If I am skilled in Python, don’t tell me I need to use R. A model management system should be language agnostic. The reason I point this out is that there are many tools on the market which offer a model management service, however these are locked to a single language – I will publish a meta-analysis of these later in the series.
Machine Learning development can either take place locally or in the cloud across many providers (Azure, Google, AWS as well as private cloud). Depending on the amount of data required to train the model, a data scientist may need a distributed architecture such as a Data Lake on the Hadoop file system, Spark may also be an option, where in-memory processing is required. If deep learning is part of the project then a separate set of tools and hardware is required (GPUs and the software to support this). There are many different skills and languages that can be used for Data Science, as such it is difficult to imagine one tool which enables Model Management across all languages.
What should a Model Management tool do?
Based on my research, a model management application should be able to complete the following tasks.
Languages – Support development in multiple different machine learning languages.
Publication – Allow the publication of a model
REST API – Exposing the model via an API. A model should be unrecognisable from its base language.
Load balanced – Handle load balancing of the model, this includes the ability to scale up and down as required.
Retraining – Allowing for retraining the model automatically. Monitor model decay.
Model variants/multi-model deployments – Support different variants of the same model (Decision tree, logistic regression, support vector machine)
Automated testing – Perform automated testing
Telemetrics – Expose reporting for monitoring of models in production.
Source control – Are models sourced and versioned? Which source control application is supported?
Now we know what a Model Management tool should do, let look at the Rendezvous architecture and how it tries to solve this problem.
Machine Learning Logistics.
My previous research was focused on tools which aim to address machine learning model management. Each of these implements part but not the whole picture. I will publish this research in a separate blog post. When looking at the available tools (Azure Model Management, H20.ai Steam or ModelDB), it is apparent that there is not currently a single tool which will allow an end-to-end model management system, which meets the conditions set out above. As an alternative, Dunning & Friedman’s book, Machine learning logistics, looks at a conceptual approach to what a model management system should do. With this option, although tools are suggested, this is not a piece of software you can download, you will need to build this yourself – I have just completed this for a customer and I will blogging (and hopefully presenting) about this soon.
Dunning & Friedman show an understanding of production machine learning far beyond other similar books. They begin by discussing the “myth of the unitary model”, a lot of machine learning books give the impression that once a model is deployed the role of a machine learning developer is done. This is far from reality. A unitary model implies that you only need to build a model once and it will remain accurate forever. This misconception could be disastrous. The thinking also falls down because it over simplifies Machine Learning. One mode might work for a subset of predictions, however Dunning and Friedman note that many models in production tackling the same business objective (customer churn) is more realistic.
Dunning & Friedman note that a model management system should meet the following conditions:
Have the ability to save raw data
Expose models everywhere
Monitor, compare and evaluate models
Deploy models in to production
Stage models in production without impacting production
Seamless replacement of a model in production
Automated fall back when a model is not performing.
Dunning and Friedman add some very valid points which support and go beyond the research by Valentine and Merchan. “Ability to save raw data”, this is not something that has come up in my on-going research but is incredibly important. A model is only as good as the data it is based on, if a model is being deployed or reworked and the data is different, this will produce a different model. “Monitor, compare and evaluate models” – Dunning and Friedman note that the unitary model is a false assumption so we would need a way to compare variations of the same model and aggregate the response. I talk about this problem in applying DevOps to Machine Learning a lot. Having better data will make a better model – some times with no code changes at all – This is a unique problem to Machine Learning.
Unlike other books and journals, Dunning & Friedman are less concerned about the process of training a model and more interested in what happens once that model is in production. Dunning and Friedman have expanded my initial list of requirements based on their experience executing machine learning for MapR.
A solution which aims to meet all the conditions listed above is what Dunning and Friedman, have coined the “Rendezvous approach”. Dunning & Friedman note that this implementation is a style of DataOps (Dunning & Friedman, 2017, pp12). This is one of the few examples of the term DataOps being used in-conjunction with machine learning model management. The Rendezvous approach is designed for any level of machine learning , although most importantly it supports Enterprise machine learning.
Conceptually the design is a disparate series of individual pipelines and models each doing their own part independently. The Rendezvous approach is proposed to be implemented using microservices. Puppet described high performing DevOps teams as one who implement loose coupling in their development – essentially microservices (Puppet, 2017). Dunning & Friedman agree stating that “Independence between microservices is key” (Dunning & Friedman, 2017 pp14). Independence suggest very loose coupling between microservices, this is often achieved using message queues. In the Rendezvous approach the message queue is proposed to be implemented using Apache Kafka – as both a message system and a postbox. Streaming messages is fundamental to the Rendezvous design.
Dunning & Friedman advocate the need for a persistent streaming service which connects in to a global data fabric (Dunning & Friedman, 2017 pp17). Dunning & Friedman note a data fabric goes beyond a data lake, it is a global store of all data in an organisation. This is a location for all data which is globally distributed and capable of storing both transactional data in tables, but also persisting streaming data. The Rendezvous approach is designed to work in concert with a global data fabric (Dunning and Friedman, 2017, pp 25).
When deploying a model from one environment to another, there will most likely be difference in the environments, Dunning & Friedman note that this could lead to unexpected results (Dunning & Friedman, 2017 pp19). One scenario to this is to use containers. Containers are a light weight structure for holding the configuration of a machine, they are similar to virtual machines, however aspects such as the operating system are not required (Dunning & Friedman, 2017 pp19). Containers need to be stateless, but hold state full applications, one way to achieve this is to persist storage external to the container service. Azure Model Management has implemented this as part of their solution. The benefit that you get with containers is the repeatability and a scalability through a cloud based tool such as Kubernetes. I will post a blog about both Docker, Docker-compose, Kubernetes and Helm in the coming weeks.
The Rendezvous approach implements a series of different model variations. As part of their standard architecture there should always be two models, decoy, canary/the machine learning model (Dunning and Friedman, 2017 pp 21). You can see this in the image above.
The decoy model is responsible for simulating a machine learning model, but rather than running a model, it captures what parameters which were passed to the model. This is to allow for monitoring. With a complete list of all requests and a copy of the data used to train a model, we can begin to monitor model accuracy drift / model decay. If a model was trained based on 100,000 customers and we have seen 50,000 new customers based on the incoming requests, this would be a good indication that this model might no longer reflect reality and our model has decayed. The canary model runs alongside the main machine learning model and provides a baseline for comparison. The Decoy and Canary enable a view or how a machine learning model is being used and why it performs as it does.
A simplistic approach to exposing a machine learning model is to wrap a REST API around the model. To evaluate the model, you send a GET/POST request to the model with a series of parameters, and the model replies with the answer. Dunning & Friedman note that this simplicity is both a blessing and a hindrance. With this approach, we are unable to pass the incoming parameter off to more than one version of the same model. We could use a load balancer to send the requests to different machines. A load balanced works to reroute incoming traffic equally over many machines, or in this case models. While a load balancer will point the input to a different model this does not allow for comparisons. A load balancer would work similar to an AB test. Based on the load on a server, a result would go to the primary server or the secondary. With this approach, there is no ability to run different models and aggregate the results.
Dunning & Friedman note that an alternative to a load-balancer is to use a data stream. Data is streamed in to an input, from there that input is distributed to multiple outputs. The distribution of the request to the models is the real strength of the rendezvous architecture. As an example, we are trying to classify if a transaction is fraud (binary classification – true or false). We could use one of many different classification algorithms, or we could use all the ones that work for our prediction and select the best option. We might use a simple decision tree which returns a response in <10ms, then we might have a logistic regression which takes a bit longer. If our architecture pushes the request to multiple models, then our code always needs to know which models are live.
This is problematic. We really do not want to deploy a code change with each new model. Dunning and Friedman propose an alternative design using post boxes. When a request for classification is sent, rather than that going directly to the model, or to a service which sends it to multiple models, it goes to a post box. Then each of the models subscribe to that postbox and each run in parallel. They then need to pass their results and telemetry data back to a service which will decide what to do with the responses.
The Rendezvous approach implements a rendezvous server, which is responsible for comparing scores and passing the acceptable value back the request. The Rendezvous server is responsible for picking which model has the desired result. The rendezvous server seeks to pull in the values obtained by the models and either pick the best response or aggregate the results. If two of the three models indicate fraud, then a response of fraud would be a good response. Weights can be added to the models to steer the selection based on the machine learning developer’s intelligence. The concept of the Rendezvous server in machine learning is a powerful design for model management.
Monitoring machine learning models in production is key to understanding if performance is degrading over time. Although important, this is seldom covered in literature, however Dunning & Friedman note that it is critical to monitor what a model is doing in production (Dunning & Friedman, 2017 pp 49). Monitoring the availability of a model is not enough to understand how it is operating. To effectively monitor, a model management system needs to capture operational metrics, what input was offered, what answer was given, what was the level of accuracy. Dunning & Friedman recommend either passing these metrics with the REST API response or persisting that data in to a side data store. A model management system should aggregate metrics over time, while this is possible by looking at the message queue, the data would sit much better in its own data store. In the implementation I have used, I used CosmosDB as the store for telemetrics.
If a model is replaced, we need to analyse the potential business impact (Dunning & Friedman, 2017 pp 50). Dunning & Friedman recommend a staged transition from one model to another. A sample from each model is passed to multiple models until such a time that the developer is confident that the model being replaced is satisfactory. This approach does need to be evaluated on a case-by-case basis. If a production model is designed to shape a user’s experience of a website and each time they refresh they hit a different model which recommends a different experience, this will be confusing to the user. Logic for handling the selection would be supported by the Rendezvous server. This is known as a policy.
Dunning raises an interesting idea which would further extend Rendezvous to not need a policy. Dunning suggests that a multi-arm bandit/reinforcement learning algorithm could be responsible for deciding which model to return to the requestor. This is quite advanced and relies on having a feedback loop. If possible then what you have is incredible
Making comparisons of models however is not as easy as in traditional software development. In traditional software development, there are standard key metrics to monitor, once decided, rarely do we need to monitor additional metrics. In a web application, this might be the time a webpage takes to respond or the time taken for the database to respond (Ligus, 2013. pp 21). However, machine learning is not this simple. A classification model will return a different set of metrics to a regression model, therefore making a comparison of each quite difficult. We can however work around this.
Dunning and Friedman’s Rendezvous architecture proposes a solution to model management beyond that currently implemented by tool such as ModelDB, Steam or Azure’s Model Management. Rendezvous takes elements of DevOps to achieve this, however does not detail the full process of how to implement it. There are many different approaches one could take. The following blogs takes elements from Dunning and Friedman’s architecture and relates it back to a DevOps approach.
Beyond the book
There appears to be a shift in the machine learning industry towards a recognition that model management is a problem which needs to be resolved. Literature does not prepare data scientists for the problems associated with models in production. The tools that are on the market offer only some of the they key areas required to manage models.
In a recent discussion which took place on Twitter, Caitlin Hudon asked “What’s your team’s approach to tracking the quality of models in production”. This started a lively debate and discussion on the various implementations that people have stated building. This conversation was supported by the authors of Machine Learning Logistics Ted Dunning and Ellen Friedman as well as many other industry experts.
The outcome was that data scientist are noticing a change towards a structured approach to model management. I imagine that over the next 2 years, there will be an increase in model management tools and books. As popularity in DataOps increases it is easy to see many data scientists moving towards a mix skilled team of developers and model managers. In a recent podcast from the O’Reilly data show, Ben Lorica was talking about the rise of the term “Machine Learning Engineer” someone who id responsible for getting models deployed. I think this can be achieved with DevOps (although someone is required to set this up). I also think there will be a rise in another new role, that of the “Model Manager”. Someone who keeps models running, monitors for decay and retrains when required. This might be the same role, it might not.
Schutt & O’Neill notes that the productionisation of machine learning is hard, however by using the right tools and the right architecture this problem can become trivial. Though source control, continuous integration, continuous deployment, infrastructure as code and monitoring, the process of deploying a model to production can be fully automated.
Thanks for reading. I hope that you do go and read the book. It is a great read!