Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

 

Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.


Azure Databricks. The Blog of 60 questions. Part 4

databricks4.png

Co-written by Terry McCann & Simon Whiteley.

A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.

There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.  

Part one. Questions 1 to 10
Part two. Questions 11 to 20
Part three. Questions 21 to 30
Part four. Questions 31 to 40
Part five. Questions 41 to 50
Part six. Questions 51 to 63

Q31: Can Power BI read Parquet yet? Or would you need to output to something like csv? Or have the cluster running and run a Spark query?

A: At this point I do not think Parquet is supported by Power BI. To use Power BI you would need to connect to it using the inbuilt Spark connector. The cluster would need to be running and it would be best to have your data partitioned and cached.

https://docs.azuredatabricks.net/user-guide/bi/power-bi.html

Q32: if you are starting with "schema first" (relational) data, surely putting it into a "schema last" (Schema-on-read) store is just adding a complexity that doesn't add value. Why not keep it in relational form as it flows into your data warehouse. Or are you anticipating the schema from the source systems is evolving over time?

A: This is a really interesting question and one that suggest the volumes of data do not need Databricks. Where we have data volumes either too complex or too large for a relational database to process in an normal time frame we need to look at a scaled out approach. If we are exporting the data from a relational system to store in a Data Lake for analysis or data  science, then we should store the schema as a json object along side the data. That way we do not need to infer from the data. If you can process all your data in a database, then do that. No point in adding additional complexity.

Q33: How about master data management? How do you do this in such a pattern? (In a data lake)

A: MDM in Azure is a bit of a sore spot. Right now I recommend that you look at open-source offerings and run one of those on an IAAS VM.

Q34: What about the part of moving data from our sources to Data Lake ? what is the best practice?

A: Use Data Factory.

Q35: Is there a reason why you don't / wouldn't add a library to handle all the mounting work for you?

A: No, other than it is another thing to manage.

Q36: Can you get on premises data loaded into ADLS using Databricks from ADF for example?

A: ADF is a good way to get data into a lake, but if it's on-premises you will also need a "self-hosted integration runtime" for ADF to access the data on-premises, ADF could use an SFTP server as a source and the lake as a sink for example, I would use ADF to grab the data, then possibly use something like Event Grid to watch for files to trigger a pipeline in ADF to clean the data

Q37: I always wondered why companies decide for "data vault" modelling in the relational DB if they can do rapid prototyping based on their data lake. Does building rapid prototypes based on the data lake replace a concept like "data vault" ? What is your opinion ?

A: I think it does, but I have never liked Data Vault. With big data we do not want join. Data Vault is all about joins. Now I know of all the comments I make in all 60 questions, this will be the one which gets the biggest kick-back!

Q38: When reading JSON files, can you pull that JSON schema out of a file?

A: Yes. Add the options "InferSchema" = "True"

Q39: Can you redirect failed rows to a different dataset? How about text values that are effectively nulls, eg. "n/a" etc.

A: Yes there are additional options you can set when importing data. You have the option yo choose from PERMISSIVE, DROPMALFORMED and FAILFAST.

https://docs.databricks.com/spark/latest/data-sources/read-csv.html#verify-correctness-of-the-data

Q40: Can you pass parameters into a Databricks workbook from Azure Data Factory?

A: Yes, here is a good example of how you do this in general, then it would as simple as setting up a parameter in Azure Data Factory.

https://forums.databricks.com/questions/176/how-do-i-pass-argumentsvariables-to-notebooks.html