Azure Databricks. The Blog of 60 questions. Part 5
Co-written by Terry McCann & Simon Whiteley.
A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.
There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.
Q41: Will .option("mode", "PERMISSIVE") work for Scala workloads?
Q42: Is there a standard design / dev pattern on how to work with metadata (extracting, updating, reusing)?
A: I would love to say yes and here it is, but no. Thee is no accepted framework. It is worth checking ISO for various standards on metadata for your industry.
Q43: Does a json file with an array on the root level qualify as the 'massive data file' which cannot be handled multithreaded?
A: This relates to how we can process and chunk up a large JSON file. If your JSON file is delimited with a carriage return and line feed, and holds multiple documents then it should chink up. If it is a huge file with CR/LF after each block this might not. It is worth a try and see what you can do to optimise it.
Q44: is there a magic figure of rows for parquet to compress into a compressed row group ( like in SQL server it is 1M)
A: This question concerns the behaviour of Parquet in compression to columnstore compression in SQL Server. I am not aware of any limit for compression.
Q45: Is it reasonable to be swapping back and forth many times between languages in a notebook script?
A: Covered in another question. Yes, but try not to.
Q46: All of these languages, why not C#, and don't say it is not academic or that it is slow...
A: This is going to be supported. https://github.com/dotnet/spark
Q47: Is there a risk that Databricks allows cost savings by reducing expensive/niche data science resource that spend 70/80% of their time data wrangling, only to replace it with expensive/niche data engineers that need to be proficient/efficient in several languages to be able to maintain the Databricks/pipeline estate?
A: This is a valid argument, but removing time from a scarce expensive resource to an automated process will allow the Data Science team to work more effectively and create a better return-on-investment.
Q48: Not sure if I missed something. SQL Data Warehouse is Azure only right?
A: Sort of. Yes it is only Azure, but it is based on PDW which is on-premises.
Q49: Azure DWH vs. Databricks - when would you chose Databricks over Azure DWH (Polybase / In Mem / language support etc.?)
A: Too big a question to answer here. When I have more time I will come back to this with a full answer. For now.
Languages. If the team only know SQL then use ASDW
If you want to do Machine Learning the Databricks
Finer-grain cost management? Databricks
Q50: Doesn't ADF data flows go against the whole Extract-Load-Transform (ELT) pattern that everything else in the MS Azure ecosystem is built around?
A: No. Although you're building what looks like a SSIS dataflow, this will compile to an RDD and run as a Map Reduce job where the data lives. This is still ELT.
Data Platform Microsoft MVP & Voice of Data Science in Production You can follow Terry on twitter @SQLShark where he is frequently discussing Data Science in Production.