Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Azure Databricks. The Blog of 63 questions. Part 6

databricks6.png

Co-written by Terry McCann & Simon Whiteley.

A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.

There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.  

Part one. Questions 1 to 10

Part two. Questions 11 to 20

Part three. Questions 21 to 30

Part four. Questions 31 to 40

Part five. Questions 41 to 50

Part six. Questions 51 to 63

Q51: What's the cost of running transformation in ADF Vs running on your own Data Bricks cluster?

A: We don't have much details on it yet. It'll be built into the ADF run-time costs eventually, the same way that "Data Movement hours" are actually using servers behind the scenes but we don't see it. There's part of me that suspects it'll be more expensive, because they're "doing some things for you" - but then you're not using the Databricks work-space - so maybe you won't need to pay the DBU overhead.

Q52: Does Databricks and Data Lake open any new opportunities for parallel processing on datasets? For example, is it possible to use these technologies create multiple (i.e. large numbers) of new (calculated) columns on dataset in parallel rather than creating each new column sequentially (as you'd need to do on a database table)?

A: Yep, absolutely - you would simply line up all the transformations, call an action to write it out to the database, and the catalyst engine will work out the most efficient way to manage the data and perform those transformations. The engine will perform transactions together where possible - if they are all narrow transformations that have the same partitioning characteristic, then it'll try and perform then in as few a steps as possible.

Q53: Notebooks feel sequential but I'm guessing they're not?  Say I wanted to load a warehouse, do 20+ dimensions first and then (and only then) populate the fact, how would you do that?

A: They're sequential as far as the actions go - once you call an action on a DataFrame, it'll then figure out the best way to perform the transformations you've lined up. I tend to isolate each data entity into a separate notebook - ie: one per dimension, then use an external tool to kick them all off in parallel. Ie: You can build a data factory pipeline that looks up a list of notebooks, then runs all the notebooks in parallel. Definitely prefer pushing the management of orchestration & parallelism to an external tool, rather than building in "parent notebooks" that run all the other logic, it's much more visible and configurable

Q54: Is there any way to stop Databricks having access to the internet?

A: You should be able to peer it with your own VNet and decide the inbound/outbound rules based on the parent vnet you're connecting to. However be aware that there are two different levels of connectivity - the workspace itself it always connected to the internet, it's the clusters themselves that you can fine tune control of. Yep, just like the Azure portal itself, I don't think there's a way of forcing communication with the Databricks portal via Express-route. But yes, you can definitely firewall the running code and control what the individual clusters can see. and if you do this (referred to as Vnet Injection) you can then also firewall the storage accounts and data lakes and have them be accessible only inside your VNet using service endpoints - works really nicely (edited)

Q55: If you are light on data for one classification label are there any techniques to artificially generate data to help train the model?

A: SMOTE - Synthetic Minority over-sampling technique.

https://www.youtube.com/watch?v=FheTDyCwRdE

Q56: What's the difference between the spark DataFrame and pandas? i.e. why would you use one or the other... Is it that certain libraries only work with one or the other?

A: I would recommend a read of this article https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html Yep, essentially pandas is an alternate DataFrame type - it has different functions etc hanging off it. The Spark DataFrame has brought a lot of the things that were good about Pandas. There's a little danger with a couple of the more weird & wonderful pandas operations where they won't distribute properly or end up outside of the JVM

Q57: Is there a way to get a visual of the decision tree?

A: For anyone reading back through the comments after the session: Terry answered that yes, you can view the decision tree if you're building the model in Scala, but the python libraries don't expose that functionality out of the box. There are a couple of specialist libraries you can use to try and replicate these features.

Further to the above, we can pull a debug string from the decision tree JSON, then visualise that with matplot / seaborn / d3. https://github.com/tristaneljed/Decision-Tree-Visualization-Spark

Q58: When doing Machine Learning, How do you know when good is good enough in the real world? Do you wait till you run out of budget?

A: Yep, that's one of the harder parts of running Data Science projects. There's the temptation to keep going and going and going, getting diminishing returns for your efforts. Normally, we recommend doing a time boxed study, get as close as you can within the time & budget allocated, then make a call as to whether the results are accurate enough and estimate how much additional work would be needed to get close enough to what's "acceptable"

Q59: Does spark have support for feather? Usually a solid way of writing DataFrames for a number of reasons ( e.g. if you don't want the compression overhead in parquet when dealing with streaming data)

A: Not that I've come across. Had a quick dig and it sounds like you can jimmy it in, by pushing your DataFrame over to pandas and saving out via the feather libraries. I suspect that's going to be awful for distribution performance though.

Q60: If Databricks/spark structured streaming is overkill for small amounts of real-time data, what's a better solution for getting data from an event hub and pushing it to a real-time Power BI data set (not a streaming one, the other kind)?

A: If you've only got one or two streams, then yeah, it can be overkill even so. So if you want to use Stream Analytics but need fine tune control for what's being inserted (ie: you can't use the out of the box connector for stream analytics), you can always push the events to a logic app or Azure function, manually call the Power BI API etc

Q61: What is the data storage format of Delta? What can be a fitting use case of using delta?

A: Delta is Parquet ++. It is a more optimized parquet.

Q62: You mentioned earlier that Databricks have contributed a lot to Apache Spark.  Have they contributed Delta?

A: So we have no idea about their long term plans, but they're currently heavily branding it as a "Databricks" feature, not a part of the core Spark API. I suspect it'll say as a proprietary Databricks feature - Is what I would have said. But nope. They have open sourced it! Delta Lake. https://databricks.com/blog/2019/04/24/open-sourcing-delta-lake.html

Q63: If I need to run a "normal" python script ... e.g. get some data from a REST API and put it into the data lake ... and I want that orchestrated as part of a data factory pipeline, is Databricks still a decent option even though it's not going to spread the work across the cluster? Is there a better option?

A: You've got Python Azure Functions these days - that'll be a cheaper, faster way of hosting that if it's just running pure python and doesn't need a distributed workload. But as it is distributed, if you have a cluster of 10 machines, each will request will happen in batches of 10. So it should be a lot quicker.


Terry McCann

Director of Artificial Intelligence

Data Platform Microsoft MVP & Voice of Data Science in Production
You can follow Terry on twitter @SQLShark where he is frequently discussing Data Science in Production.