Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

 

Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.


Azure Databricks. The Blog of 60 questions. Part 2

databricks2.png

Co-written by Terry McCann & Simon Whiteley.

A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.

There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.  

Part one. Questions 1 to 10
Part two. Questions 11 to 20
Part three. Questions 21 to 30
Part four. Questions 31 to 40
Part five. Questions 41 to 50
Part six. Questions 51 to 63

Q11: Is there a way to develop in something like VS Code so you get intellisense and all the niceness?

A: Yep - you can bits & pieces of intellisense in there, but you can also write Python/Scala in VSCode, but you'd be writing is as a standalone script rather than a notebook. Mentioned in one of the other answers there is now Databricks connect. It is ok. For Scala I would recommend DBConnect and build a Scala project. That way you can do all the things we should be doing such as unit tests.

Q12: Which version of SQL does Databricks used.

A: From Spark 2.0, Spark implements ANSI 2003 syntax for SQL. https://spark.apache.org/releases/spark-release-2-0-0.html

Q13: Why does "A = transformationDFGroupBy.explain()" return results when you are just assigning a variable (thinking the Lazy nature of Spark)?

A: Sometimes it needs to see the data and the stats about that data to an estimate on what will work best. That job is reading some metadata.

 Q14: Are there significant limits when mixing languages in a single notebook? E.g. if I have a python notebook and create a DataFrame in a %Scala magic, is that DataFrame visible for the subsequent steps?

A: Yes you can create a Scala DataFrame and the reference it in Python. There are a lot of stuff which will break this. Ideally do everything in Scala or Python. But there are times when you will mix. We have stuff in production which uses a mix. Ideally we use the same. Caveat to this. If you create a notebook which uses multiple languages, think about the next person that has to try to debug what you have written!

Q15: Can you use Azure Key Vault in place of Secret Scopes?

A: Yes, there is an option to do that. But there is a bit of work to set it up. This should be the preferred approach. Create a scoped secret which is backed up by Azure Key Vault. If you need to change the value in the secret, there is no need to update the scoped secret. There are loads of benefits to doing this, the main one being that you will most likely have multiple workspaces and managing secrets on each in isolation is painful. 1 Key Vault or many? Stick with the fewest possible. https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html#akv-ss

Q16: How do  you manage the code in Databricks, when working on it with a team in TFS or Git?

A: To start there is no support for TFS. Git and Git based decentralised version control systems are your only option. The ideal way of working would be to link Databricks with your Git folder of notebooks and see all the notebooks in there, however Databricks works like a another fork of your code. You create a notebook, commit it to version control and then commit updates. There is no way to map a connection and see all notebooks, without importing them and linking them. Hopefully this will improve over time. As mentioned in another question, you may want to consider working in a Scala project which you can compile as a JAR. That process can be all automated with SBT and Azure DevOps if you wish.

Q17: is Databricks available to run in a private cloud or do you need to use public cloud like AWS and Azure? and if so how does it compare to the PaaS offering in say Azure that we are using today?

A: A simple one to answer. No it is not. AWS and Azure are your only options at this point. But Databricks is running open-source Spark. You could setup your own cluster and run on-premises or in a private cloud, but you will not have all the rich functionality and management which Databricks offers.

Q18: Can you manage Databricks With PowerShell?

A: Officially no. But there is a great PowerShell module created by fellow Data Platform MVP Gerhard Brueckl. What is even better, is that this works on both AWS and Azure. It is free and Gerhard is on twitter if you want to send any questions to him . https://twitter.com/gbrueckl

https://blog.gbrueckl.at/2018/11/powershell-module-databricks-azure-aws/

Q19: If scoped secrets are REDACTED, are they also redacted in logs etc? 

A: Presumably, however secrets are easy un-redacted.

https://docs.azuredatabricks.net/user-guide/secrets/redaction.html

Q20: How can you use PostMan for accessing the API?

A: For those that love postman and want to try the Databricks API: https://github.com/GaryStrange/AzureDatabricksPostmanCollection