Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Spark 3.0 Questions and answers from the Data AI Summit

2020-DAIS-EU-Rectangle-1200x628-1.jpg

At the Data + AI Summit, Simon delivered a session on “Achieving Lakehouse Models with Spark 3.0“. During the session there were a load of great questions. Here the questions and the answers from Simon. Drop a comment if you were in the session and have any follow up questions!

Question 1

“while all the Hadoop providers promoted the DataLake paradigm back then, how the industry and the other data lake providers are shifting to/considering the lake house paradigm?“

It's a direction that most providers are heading in, albeit under the "unified analytics" or "modern warehouse" name rather than the "lakehouse". But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It's a bit of a race to who gets to the "can do both sides as well as a specialist tool" point first. Will we see other tools championing it as a "lakehouse", or is that term now tied too closely as a "vendor-specific" term coming from Databricks? We'll see...

Question 2

“Instead of SCD2 we could use just dimension snapshots on data lake. The storage is cheap, the logic is simple and it is easy to use in analytics.“

Absolutely - there are a load of different tools and approaches we can use that are more native to lakes, but they're not going to be the first to mind for people coming from a SQL/Warehousing background, they're not going to be able to lift & shift their processes over. And it's all about accessibility to those additional personas

Question 3

“What about MPP databases such as Azure Synapse Analytics (the part that used to be Azure SQL DW). Is there no need for it when you have a Lake House?”

When lakehouse is fully matured, that's the plan - have a single source of data whether it's engineering, data science, ad-hoc analytics or traditional BI. We're getting there, but there are still some edge cases when you would need a mature relational engine, be it for some of the security features, tooling integration etc - but as it matures, fewer and fewer of these edge cases remain. But fundamentally, they're serving the same purpose, so we're heading to the point where you have just one

Question 4

“Are you using HIVE for your metastore, or where do you keep the catalogue of all delta-tables etc.?”

Yep, using HIVE in most cases, especially when it's a Databricks-based architecture. We augment this with our own config db - just a lightweight metadata store that we use to store different processing paths, transformation logic etc. An element of this includes comments/descriptions that are used to augment the HIVE tables with additional info

Question 5

“how do you connecting the dots to do all the glue code? are you using a orchestrator (airflow, dagster, etc.), if so which one?”

We stick with fully Azure native, so we use super basic Azure Data Factory pipelines - which in themselves aren't as dynamic as airflow, dagster etc. But we keep it super simple, ADF basically checks out metadata database, gets a list of things to kick off, and fires the tasks, we deliberately stay away from any more hardcoded pipelines. We had to make a balance between choosing something our clients have some knowledge of currently vs picking a tool that would have more functionality / harder learning curve

Click the link to view the session

2020-11-19_11-27-38.png