Advancing Analytics
Data Science | AI | DataOps | Engineering


Data Science & Data Engineering blogs


Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.

Azure Databricks. The Blog of 60 questions. Part 3


Co-written by Terry McCann & Simon Whiteley.

A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.

There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.  

Part one. Questions 1 to 10
Part two. Questions 11 to 20
Part three. Questions 21 to 30
Part four. Questions 31 to 40
Part five. Questions 41 to 50
Part six. Questions 51 to 63

Q21: Can I manage the azure Databricks resource, when setting up, to point to a dedicated resource group? Now it just creates a concatenated random string ( ex. databricks-rg-databricks-demo-workspace-blah), and what if I only have only 1 resource group at my disposal ( user rights point of view)

A: There are multiple parts which get deployed when you create  a Databricks workspace. When you create a workspace in a resource group a seconard resource group will also be create. This is where the cluster will be created and also a blob storage account for the Databricks FileSystem (DBFS). Through the portal you cannot specify which resource group is created, this is intentional. You can however specify it if you deploy using an ARM template. The resource group with DBFS and the cluster in, is managed for a reason. I saw a customer recently create 2 workspaces both pointing at the same managed resource group. All worked fine, until they deleted one of the workspaces, and it delete the shared managed resource group. Needless to say they lost everything on the other workspace too. In summary, you can with an ARM template, but you should not! This has been raised and is on a backlog to be corrected.

Q22: Is it possible to use Spark for streaming data?

A: Yes Spark a major component of Spark is streaming. It supports multiple streaming processes. You can rea from a stream and write to a file, read from a stream and write to another stream, read a file to a steam, stream multiple deltas. Streaming is part of Spark core. It is worth calling out that the Databricks recommendation around streaming is to have one cluster per stream. This is so that in the event that the cluster goes down, you only impact one stream. It will cost you more money to have multiple clusters.

Q23: Which is the most reliable Internet Browser for using with Databricks (e.g. Chrome, Firefox, MS Edge, etc.)

A: Taken from the  documentation:

  • Google Chrome (current version)

  • Firefox (current version)

  • Safari (current version)

  • Microsoft Edge* (current version)

  • Internet Explorer 11* on Windows 7, 8, or 10 (with latest Windows updates applied)

Q24: could you potentially automate the deployment through ADF, as that supports calling REST APIs? E.g. grab latest branch, deploy, then run?

A: Automated deployment is a bit of a sticking point at the moment. To us automation you need to have a bearer token, this can only be manually created at the point. You need a two step process. Deploy the workspace and managed resource group, get the bearer token. Then use the bearer token and the API to do all the automation you require. It is only a small annoyance. Hopefully this will change very soon.

Q25: Would you recommend taking the Databricks certification

A: Yes!

Q26: Where in databricks do you set the # of partitions?

A: spark.conf.set("spark.sql.shuffle.partitions", 10), That is set for the session and not the cluster. So you need to run it before you run any code. Then if you're writing data out of Databricks, you can specify how you partition.

Q27: Can you dynamically increase partitions as workers scale with DB?

A: Ish, you could do it. Ideally to set this to work with the upper size of the cluster. If you do auto scale set the partitions to the max size not the small size. Not really recommended.

Q28: What's the least privilege permission to grant to a service principal to access a Data Lake - Storage Contributor?

A: Not a Databricks question but a good one. We would normally try to avoid RBAC (Role based access control) as it is not fine grained enough. We would like to give the SP permissions to only what it needs to do its job. Separation responsibilities is important. I.e. Can read data, but cannot over write the data it has read.

Q29: Text processing support all languages? Can my data be in any language? What are the implications of using multiple languages? (EN/FR not Scala/Python)

A: This is very much dependant on the package you're using. In Python we have NLTK and Spacey. Both support multiple languages. How well each of the individual language encodings work I am not too sure. But they are both very popular. For Spark DataFrames there is mllib and the John Snow Labs state of the art NLP library.

Q30: how does it magically get into the lake - surely there is some work there?

A: Yeah, there is a lot of work there. We often find this is one of the are areas customers most need support. Get in touch if you want more help here. Alternatively we have a 3-5 day training course on this topic.