Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

The Power of Verti-parquet & DirectLake in Fabric

In the data engineering field these are two pain points at the core of our work. We must ensure that large datasets are compressed efficiently to save storage space and reduce costs, but we also need to maintain strong query performance. This can often be a hard balance to achieve and is a constant fight between storage and compute. We’ve all been there! This is where Microsoft are positioning some of the many benefits of Fabric and OneLake, by utilising the VertiParq engine and Delta:

  • Impressive compression rates

  • Quick query performance

  • Queries run directly on compressed data in lake

  • Ordered data by default

Lets see how this all works under the hood.. and you’ve been warned - we’re going to go deep.

So what is VertiPaq?

The VertiPaq engine is a powerful and efficient in-memory columnar compression engine developed by Microsoft. It is primarily used under the hood in Power BI. The engine is designed to optimise storage and query processing for analytical workloads by leveraging advanced compression techniques and columnar storage. VertiPaq excels in achieving high data compression ratios, reducing memory footprint, and improving query performance. It efficiently stores and compresses data in memory, allowing for fast query execution by selectively decompressing only the necessary columns and portions of data required to satisfy the query.

The VertiPaq engine cleverly uses columnar storage for efficient querying and processing. It employs multiple compression techniques, including Run-Length Encoding (RLE) and Dictionary Encoding, to minimise storage space. Through finding optimal sort orders and value encoding, it achieves maximum space efficiency and performance. VertiPaq also utilises 'In-Memory Column Store' for fast query performance, 'Predicate Pushdown' to eliminate unnecessary data at query time, and 'Block Decompression' to only decompress relevant data blocks, making it a powerhouse for data management and retrieval.

Now, because of these ingenious tricks, we wave goodbye to traditional file formats like JSON or CSV. Instead, all data stored within the managed area of Fabric and OneLake uses either Parquet or Delta. It's time to embrace these efficient, high-performing formats that bring the best out of VertiPaq's compressive power. Let's explore these further in the next section.

Parquet

Parquet is an efficient data storage format that uses 'Columnar Storage' and 'Predicate Pushdown' for optimal queries and reduced storage size. Its 'Schema Evolution' allows for backward-compatible updates, while detailed metadata and statistics enhance query performance. Parquet's diverse 'Compression' algorithms and 'Row Group and Page Organisation' allow individualised, efficient data management. Finally, it saves significant space through 'Predicate and Dictionary Encoding'. In short, Parquet's smart features make it a prime choice for high-performing data storage. Find out more specifics here.

You can see an incredible overlap between the underlying file type Parquet and what the VertiPaq engine is doing. For the eagle eyed, you may notice when pulling your data into the OneLake that the file types are .snappy.parquet. Snappy is the parquet compression standard used within the OneLake to offer the fastest query times. You may also notice however in the UI the data type is showing as Delta.

Delta is built on top of parquet, therefore all the benefits from above are valid and don’t change. Delta introduces additional metadata in the form of the transaction log. All compression standards are the same. The difference is in the metadata within Delta that provides ACID guarantees which in turn offers versioning and time travel capabilities.

Compression Comparison

Three tests were run. First a relatively narrow Spotify dataset. I decided to use this dataset to give a completely fair test between the tools and this should be a relatively new dataset which may have not seen before. For both tools I created a shortcut (Fabric) and mounted (Databricks) the Gen2 account and read the files directly to a spark dataframe. Writing both down as a delta table.

Dataset: Million Song Dataset + Spotify + Last.fm found here on Kaggle: https://www.kaggle.com/datasets/undefinenull/million-song-dataset-spotify-lastfm?resource=download

Two files:
User listening history      602.57MB,          3 columns,         9,711,301 records
Music Info                           14.99MB,             21 columns,        50,683 records

Results:

User Listening History.csv
Fabric: 71.3MB (16 files)
Databricks: 80.7MB (5 files)

Music Info.csv
Fabric: 7.8MB (4 files)
Databricks: 6.8MB (4 files)

Second test: COVID-19 by-age dataset https://health.google.com/covid-19/open-data/raw-data
By-age.csv           1.15GB,                152 columns,     3,822,577 records

A second test on a wider dataset which may more accurately reflect most of your organisations data. I used the open data for COVID-19 provided by Google. The data here is wide and contains lots of NULL values and from my initial data exploration has a less obvious sort order to the data than the Spotify dataset as there appear to be lots of repeating column values.

Results:

Fabric: 183.3MB (16 files)
Databricks: 161.8MB (9 files)

Removed the age bin columns at the end of the file and converting types as ints:
Fabric: 195.4MB (16 files)
Databricks: 157.4MB (9 files)

Finally to reduce the data granularity, group by year and month:
Fabric: 24.1MB (24 files)
Databricks: 24.9MB (4 files)

Final test: Expedia Flight Prices dataset https://www.kaggle.com/datasets/dilwong/flightprices
itineraries.csv           31.09GB,                27 columns,     5,999,739 records

Results:

Fabric: 7.03GB (232 files)
Databricks: 7.15GB (232 files)

After confirming these findings with Microsoft they expect that the larger the datasets the benefits of vartiparquet will become clearer. Which was observed in this final test that had the largest dataset. Fabric compressed the files slightly more efficiently than databricks.

Across these tests there is no clear winner between the two tools. Across the various tests each had their strengths. In pure data compression we may note here that databricks when dealing with smaller datasets compresses data in fewer files which as production datasets grow may produce more performant datasets in read and write operations. Whereas for the larger datasets from the outset Fabric may be a good choice.

So how does it perform when reporting on these datasets? Running two tests one the compressed datasets from our tests above I will connect to both DBX using an SQL warehouse and using DirectLake with Fabric. See below the results of these:

PowerBI test: DBX (Spotify dataset)

PowerBI test: Fabric (Spotify dataset)

PowerBI test: DBX (COVID dataset)

PowerBI test: Fabric (COVID dataset)

When connecting to PowerBI and using these tables for reporting we observed that within the Fabric platform there we very performant queries 95% quicker in the two tests! With such a clear difference being seen in loading times even on the relatively small test datasets which would suggest that in a production environment a real difference may be observed.

Conclusion

Although in my performance tests, it was observed that Fabric, while not always matching the same compression standards as Databricks on a one-to-one basis, showcased its true strength in integration and ecosystem optimisation and with a slight edge in the largest dataset test. With Databricks, file size was slightly smaller than that of Fabric in the smaller datasets, but potential bottlenecks emerge when attempting to report outside of the platform. However, with Fabric, once the data resides in the OneLake, extra benefits become clear. The tight integration and ease-of-use within the ecosystem facilitate streamlined data analysis and reporting using DirectLake, even if the raw file compression benchmarks might not be at the same level. Queries are run directly on the compressed data in the lake and as shown in the PowerBI query result times.

While immediate compression standards may slightly favour Databricks, the true value of Fabric lies in its holistic approach, emphasizing ease-of-use, seamless integration, and simplified licensing. As the product evolves and refines its performance characteristics, it holds the promise of not only bridging the compression gap but also delivering a comprehensive solution that brings data analysis and reporting to new heights to many users. Are you struggling with data compression issues or reporting times that appear to be less performant than expected? Contact us today to see how we can help.