Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

 

Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.


Docker for Data Science: Docker Fundamentals

docker fun.png

Why do we need Docker?

I have worked in software development for a while and in that time I have worked in a variety of different organisations. I have seen software deployed in about every way possible. Let's go back in time and talk a bit about how software was deployed on a single box.

Bad Old Days

You're a developer and you have created an application for the business you work for. It is fantastic and does everything the business needs it to. You want to get it deployed. You can and talk to the operational team and say "Hey I have finished my application, let's put it live" and they will say "Great, what spec server does it need to run on?". As a developer you don’t really care what it runs on, just so long as it runs well, you give the operational team an idea and they figure out the rest. To ensure that the application will run without issue, they go and buy a big server with lots of RAM and CPUs. It runs fine. Your work is done.

Now let look at what went in to the deployment of the application. The Operations team had to go and create a new server, this might have been in-house, it might have been in a data centre that that pay for, it doesn't matter, there is now a server which needs to be maintained. That is a full-time job for someone, monitoring the server, securing it, replacing failed disks. There is a lot of work there. The operations team then also had to install an operating system on the server. With that there is the associated operational cost and maintenance overhead. Let's say it is Windows Server, well then we need an appropriate licence. When a patch is release we need to install it and ensure that the server is secure. So we have a dedicated server with a dedicated OS for an application, which was built over and above the spec required. That is a lot of waste.

Well it gets worse. You build another application and you want that deployed. Guess what happened? The process started again. Over-powered server, dedicated with a dedicated OS too. All the costs are now doubled and someone has two servers to maintain and two OS which need to be monitored and patched. You build two more, four times the problem. You get my point. For each new application we needed a lot of money and effort to get it deployed.

This just didn’t work, we needed some way to make this process easier. Storing those servers in someone else’s data centre meant that we did not need to manage the servers, but we had to pay for someone else to do it.

Bad old days of deployment

Bad old days of deployment

Virtualisation

Along came the Hypervisor and virtual machines. Now we have a new application and we want to get it deployed. We ask operations to deploy it and they say it will be deployed in to a virtual environment. As long as it does not impact performance you do not mind. This time all our applications are able to share a single server. Running on that server is some kind of hypervisor, this may be Hyper-V, VMWare, or another vendor. What the hypervisor now allows us to do is create virtual servers. These are visual servers sharing the same bare-metal server, the same disks, the same CPU and RAM, just a smaller proportion of it.

We no longer need lots of servers. This is good. Less management overhead and less cost. However, when we create a virtual machine, we still need an operating system, that still needs to be patched, monitored, paid for, etc. That is still a lot of work for someone. Now the thing with operating systems, is they are quite large to install, a lot of the disk on that server is taken up by operating systems.

The hypervisor was able to help a lot. We only had to maintain a single server, but we still had to maintain all the operating systems and associated costs. This works and continue to work for lots of companies. But what is we do, is not have to be responsible for the server. What if we just want to deploy our application and have something else take care of the rest. Well that is where Docker comes in to play.

Hypervisor setup

Hypervisor setup

Containers

In our first example, the smallest unit of deployment was the server, with everything on it, For the hypervisor, it was the VM. For Docker it is the container. The smallest element of a deployment in docker is the application, which is encapsulated in a container. A container can be a complicated or as simple as you need it to be.

In Docker, we have a few core concepts. We have a Dockerfile, which contains all the metadata for how the application should run, it also states which image this container should inherit from (keep reading to understand what I mean), we need the application code that the container is going to run and anything else associated with this application. We do not need an operating system or a server.

Docker will run on a server, Docker is typically running on a Linux server, you can install a version which uses a windows machine, I do not recommend doing that today, it is not the best experience. That server is running the Docker operating system. Our application is packaged as a container and then that container is set to run on our server. That is it, it is deployed.

So what does that mean for the amount of storage on the server? As we do not need all those operating systems, we have a load more space for more applications.

Docker Fundamentals

We have mentioned images, containers, DockerFiles and more. Let'‘s have a better look at each of these.

An image is the built (compiled) version of your application. It it the DockerFile which tells the Docker Daemon how to run your application. In this file you will specify how the application will run. Firstly you will say which image should it inherit from. This image is known as the Base Image. This image will be hosted in a container registry somewhere, most likely hub.docker.com. This could be Ubuntu, CentOS or a flavour of Linux with a series of base languages installed and configured.

The image will also contain all the code and dependencies required to build the application.

Docker: An Image

Docker: An Image

 Images do not really do much. To run the application we need to build the image and get it running as a container. We will explore the commands required to do this in another blog. For now you need to understand that and the DockerFile, App and files are built in to an Image. That image will reside in your local docker instance. That image will then be run to get a container. Multiple containers and be create from the same image.  

Docker: Image to container.

Docker: Image to container.

Once a container is running you can connect to it much like any application running on a server. This container is however running locally on the machine it was run on. What if you want to run that somewhere else? Possibly on Kubernetes in Azure. For that we need to push the created image to a repository. We mentioned that the Base Image is in a public repository (hub.docker.com), we we can have our own repositories. In Azure with Azure Container Registry. Kubernetes need an image to run. To get an image in to ACR we first need to tag it and push it to ACR.

Docker: Tagging

Docker: Tagging

Once an image is tagged it can be pushed and made available to Kubernetes or another user/system.

In the next blog we will look at getting Docker installed. From there we will build a machine learning model and deploy to Docker, then on to Kubernetes.

See you next time.