Data Science. The process, problems and productionisation.
The principles of DevOps have changed the way that software teams work. Data Science and software development share many characteristics. However, the DevOps style of working is seldom seen in data science teams. Taking the principles discussed in previous blogs and applying them to machine learning, we will begin to see similar results to the effect the DevOps has had on traditional software development. In this blog I want to discuss the fundamental problem with a lot of books on Machine Learning – The fear of production. Sounds a little extreme I would agree, but there is a problem and many authors dance around the subject. This who blog series is about tackling that problem.
Why is the productionisation of machine learning so hard?
I am going to make a few generalisms throughout this blog. Some of you will agree and others won’t. Leave a comment if you want to talk more – here goes…
The productionisation of Machine Learning models is the hardest problem in Data Science.
– Schutt and O’Neill
There are typically two types of data scientist, those who build models and those who deploy them. Model development is typically done by academic data scientist, whom have spent many years learning statistics and understand what model works best in any given situation. The other type of data scientist can be better described as a data engineer (or a machine learning engineer – I talk more about this in the Machine Learning Logistics book review). The role of a data engineer is typically to build and maintain the platform a data scientist works with. On some occasions, this role is performed by the same person. The former makes up a lot of data science teams, the engineering part is typically taken from IT.
As a new data scientist looking to begin understanding machine learning, there are many fantastic books and resources you can begin with. Most books will introduce the basics of machine learning and how is it accomplished. Machine learning books will typically take a new data scientist through their interpretation of the machine learning process, however there is no one process which everyone conforms to. Many books (F. Provost & T Fawcett pp 33) reference CRISP-DM, the Cross Industry Standard Process for Data Mining as reference for a good machine learning process while others have their own view.
CRISP-DM shows a flow from data to deployment. This follows six steps in an iterative approach:
Business understanding – Understanding the problem you’re looking to solve
Data understanding – Understanding the data and what potential problems it might have. This could include profiling the data
Data Preparation – Cleaning, enriching and feature engineering.
Modelling – Selecting the relevant model and training that model
Evaluation – Evaluating if the model is a good fit, if it has an acceptable level of variance and bias.
Deployment – Deploying the model in to production.
CRISP-DM is an interactive approach, as indicated by the arrows on the outside of the diagram. This is indicating that once a model has been evaluated the process might start again until the machine learning developer is satisfied with the accuracy of the model, however the iterations stop at evaluation. Beyond evaluation is a dead end, that dead end is deployment. CRISP-DM suggests that once a model has been deployed the machine learning process stops. If you have ever deployed a model, you will know that this is far from reality.
Gollapudi defines an alternative process to CRISP-DM albeit with many similarities. Gollapudi starts with having a good understanding of the problem you’re trying to define (Gollapudi, S. 2016 pp32). Without a good understanding of the problem it is difficult to know when the model you’re building is finished. Gollapudi recommends choosing a level of accuracy you will accept, which is difficult, but having visibility of the accuracy of the model is important. Gollapudi states that a typical process flow would be from data analysis to data preparation to modelling to evaluation and finally deployment. As with CRISP-DM, Gollapudi neglects what happens post deployment.
Machine learning model development might initially stop when a model is deployed, however there are new additional activities which need to be completed and monitored. Gollapudi is not alone in neglecting a model once deployed, Ramasubramanian and Singh, 2017 make no mention of deployment in their book Machine Learning using R, Witten et al also stop their process at once a model has been created (Witten Et Al, 2011). Books on machine learning are fundamentally missing how to get a model deployed, which just adds to Schutt’s and O’Niell’s argument that this is the hardest problem in data science.
If stopping at deployment is wrong, then what happens once a model has been deployed? You could extend CRISP-DM to include a separate cycle as depicted in the image below. We have touched on a few of the problems of production machine learning in previous posts. One of those problems is Model decay. Retraining is required when model significantly decay. How do we know there has been decay? We need to monitor for it.
Monitor – What has changed in you problem domain? More customers?
Evaluate – What is the impact, is it enough to demand re-training?
Re-train – retrain your model
Fellow MVP and Data Scientist Jen Stirrup has shared her thoughts on CRISP-DM before. https://jenstirrup.com/2017/07/01/whats-wrong-with-crisp-dm-and-is-there-an-alternative/
Let’s talk more about deployment. The deployment of a machine learning model is typically not performed by a data scientist. A data scientist may produce a model in R, Python or any language they choose, and the operational deployment team may only deploy code in a standard format such as Java – You might think that sounds crazy, but many companies will not deploy a model unless it is in a “production” language. The model created will need to be refactored or even redeveloped in to the accepted business language. Provost and Fawcett indicate the risks involved with this operationalisation of a model (Provost & Fawcett, 2013. Pp.33). The process of translation is not going to be immediate, the engineering team need to understand what the model is doing, how it is performing, work out the best format for the incoming data, translate the model to the company standard, test the model and finally deploy that model to production. I don’t think most data scientists do this, yet alone an operational deployment team!
Provost and Fawcett do not elaborate on the risks, however it is apparent that this translation will not happen quickly and when it does the model will not be the same model created by the data scientist. If a model takes a long time to get in to production, there is a chance that accuracy will have decreased due to accuracy slip as reality has drifted away from the training data – again, another instance of model decay (Haung, 2017).
A mix of roles will often determine how successful a data science team is at getting a model in to production. If the team is top heavy with data scientists, often the model will be refactored by operations or never get deployed. If the team is too top heavy with engineers the model will be over fitted and not suitable. Only where there is a balance between data science and data engineering can there be a production deployment.
A final point on CRISP-DM is it reinforces the concept of the unitary model. The idea here being that one model will work for a business objective (churn), this is rarely the case. In reality we have might have multiple models running together to answer a single business objective. Each one of these models has its own development cycle and deployment cycle. You can begin to see why this is such a problem.
So we need a way to resolve this. We need to ensure that regardless of engineering ability, a data scientist can deploy a model. You might have guessed the option in DevOps. In the next blog we will look at how DevOps is being applied to data science in the industry.