Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

DevOps for Databricks: Using Rest API & Python in YAML CI/CD Pipelines

Welcome to the third instalment in my DevOps for Databricks blog series. Here we will take the Python scripts from the last blog (here) and incorporate them into CI/CD (Continuous Integration/Continuous Deployment) pipelines. All examples are available in full in the GitHub repo here

We will use Azure as our Cloud Provider of choice, and Azure DevOps as our DevOps platform of choice. As we are using the Databricks Rest API and Python, everything demonstrated can be transferred to other platforms. The DevOps pipelines are written in YAML, which is the language of choice for Azure DevOps pipelines, as it is for various other DevOps, including GitHub Actions and CircleCI.

So, our DevOps flow is as follows:


Getting Started

First thing we need is a YAML file, create this in your solution alongside you pipeline scripts, it is best practice to keep these in separate folders, as you will find in the GitHub repo mentioned above.

Now we have our YAML file there are a few core bit of YAML we need:

  • Specify Triggers: Which branches will trigger the pipeline

  • Specify DevOps Agent: This can be an “Out of the box” Microsoft Agent or one you create yourself (DevOps Agents are the compute that runs your pipelines)

  • A connection to your Azure Key Vault in which you store sensitive information, such as your Service Principal Secret

All are illustrated in the following YAML:

trigger:
  branches:
   include:
    - main
  paths:
    include:
    - notebooks/*
     
pool:
  vmImage: 'ubuntu-latest'

variables:
- group: devops-for-dbx-vg


DevOps Stages, Jobs and Tasks

It is important we take a moment to understand the different steps we use in an Azure DevOps pipeline:

Stages

A Stages can be, for example Build and Release:

1.       Build: Compile/check code, run tests

2.       Release: Push everything into your environment

 

Jobs

 Jobs run on one DevOps Agent, and can be used to group certain DevOps Steps/Tasks together

Steps

A Step can be a specific, granular “Script” or “Task”

The below diagram taken from Microsoft Documentation illustrates the relationships:

Further details can be found here

All examples in this blog will show the Jobs & Tasks only, full examples including the Stages can be found in the GitHub repo here

 


Authenticating

We need to authenticate with Databricks and store our Bearer and Management tokens as detailed in the previous blog (here). To do this, in the authentication Python script, we output the tokens from our dbrks_bearer_token and dbrks_management_token methods to Environment Variables:

DBRKS_BEARER_TOKEN = dbrks_bearer_token()
DBRKS_MANAGEMENT_TOKEN = dbrks_management_token()

os.environ['DBRKS_BEARER_TOKEN'] = DBRKS_BEARER_TOKEN 
os.environ['DBRKS_MANAGEMENT_TOKEN'] = DBRKS_MANAGEMENT_TOKEN 

print("DBRKS_BEARER_TOKEN",os.environ['DBRKS_BEARER_TOKEN'])
print("DBRKS_MANAGEMENT_TOKEN",os.environ['DBRKS_MANAGEMENT_TOKEN'])
print("##vso[task.setvariable variable=DBRKS_BEARER_TOKEN;isOutput=true;]".format(b=DBRKS_BEARER_TOKEN))
print("##vso[task.setvariable variable=DBRKS_MANAGEMENT_TOKEN;isOutput=true;]".format(b=DBRKS_MANAGEMENT_TOKEN))

An Environment Variable is a variable stored outside of the Python script; in our instance it will be stored on the DevOps Agent running the DevOps Pipelines. Consequently, it is accessible to other scripts/programs running on the DevOps Agent. We will not cover DevOps Agents in this blog specifically, the simplest description is that they are the compute that runs your pipeline, normally a VM (Virtual Machine) or Docker Container

The below YAML is creating an Azure DevOps Job that executes our authentication script:

- job: set_up_databricks_auth
	      steps:
	        - task: PythonScript@0
	          displayName: "Get SVC Auth Tokens"
	          name: "auth_tokens"
	          inputs:
	            scriptSource: 'filePath' 
	            scriptPath: pipelineScripts/authenticate.py
	          env:
	            SVCApplicationID: '$(SVCApplicationID)'
	            SVCSecretKey: '$(SVCSecretKey)'
	            SVCDirectoryID: '$(SVCDirectoryID)'

Note the “env:” element, this is injecting Environment Variables from the secrets obtained from our Key Vault. These variables are then used in our authentication script (see GitHub repo here)


Create and Manage a Cluster

Now we have the foundation YAML things are getting easier, we know what we want to do, and have already create similar tasks. To create & manage the Cluster we simply need to create tasks, like in the authenticate Job already illustrated above.

 

The below YAML runs our script to create a cluster:

- job: create_cluster
      dependsOn: 
        - set_up_databricks_auth
      variables: 
        DBRKS_MANAGEMENT_TOKEN: $[dependencies.set_up_databricks_auth.outputs['auth_tokens.DBRKS_MANAGEMENT_TOKEN']]
        DBRKS_BEARER_TOKEN: $[dependencies.set_up_databricks_auth.outputs['auth_tokens.DBRKS_BEARER_TOKEN']]

      steps:       
        - task: PythonScript@0
          displayName: "create cluster"
          name: "create_cluster"
          inputs:
            scriptSource: 'filePath' 
            scriptPath: pipelineScripts/create_cluster.py
          env:
            DBRKS_BEARER_TOKEN: $(DBRKS_BEARER_TOKEN)
            DBRKS_MANAGEMENT_TOKEN: $(DBRKS_MANAGEMENT_TOKEN)
            DBRKS_SUBSCRIPTION_ID: '$(SubscriptionID)'
            DBRKS_INSTANCE: '$(DBXInstance)'
            DBRKS_RESOURCE_GROUP: '$(ResourceGroup)'
            DBRKS_WORKSPACE_NAME: '$(WorkspaceName)'
            DefaultWorkingDirectory: $(System.DefaultWorkingDirectory)

Note the “dependsOn” element, which makes sure the previous job has completed before this one is ran. If we did not do this, we would not have our authentication tokens: We parse our tokens to the script using the “variables” element

As part of our script triggered above, we monitor the cluster to make sure it is created successfully, as detailed in the previous blog here. If the cluster creation fails an error is thrown which in turn stops the DevOps pipeline


Add Notebooks

Ok, so things are getting even easier, we’ve added authentication then created a cluster using our YAML and Python Script, and yes, you’ve guessed it, to run the script to add Notebooks it is the exact same process, YAML Job(s) and Task(s). Below illustrates the YAML needed to achieve this:

- job: upload_notebooks
      dependsOn: 
        - set_up_databricks_auth
      variables: 
        DBRKS_MANAGEMENT_TOKEN: $[dependencies.set_up_databricks_auth.outputs['auth_tokens.DBRKS_MANAGEMENT_TOKEN']]
        DBRKS_BEARER_TOKEN: $[dependencies.set_up_databricks_auth.outputs['auth_tokens.DBRKS_BEARER_TOKEN']]
        
      steps:
        - task: PythonScript@0
          displayName: "upload notebooks to DBX"
          inputs:
            scriptSource: 'filePath' 
            scriptPath: pipelineScripts/upload_notebooks_to_dbx.py
          env:
            DBRKS_BEARER_TOKEN: $(DBRKS_BEARER_TOKEN)
            DBRKS_MANAGEMENT_TOKEN: $(DBRKS_MANAGEMENT_TOKEN)
            DBRKS_SUBSCRIPTION_ID: '$(SubscriptionID)'
            DBRKS_INSTANCE: '$(DBXInstance)'
            DBRKS_RESOURCE_GROUP: '$(ResourceGroup)'
            DBRKS_WORKSPACE_NAME: '$(WorkspaceName)'
            DefaultWorkingDirectory: $(System.DefaultWorkingDirectory)

Summary

So we’ve looked at the basics on setting up an Azure DevOps pipeline to run our Python scripts. Hopefully you can see how easy this is to glue together once you have the basic YAML set up.

In the GitHub repo here you will find examples of working with Azure DevOps Stages and Artifacts, for this blog we kept it simple 

Anna WykesComment