Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Exploring Machine Learning in Microsoft Fabric: Should Data Scientists care?

Introduction

Fabric is Microsoft’s recently announced SaaS all-in-one analytics platform. It brings together Azure Data Factory, Azure Synapse Analytics and Power BI into a single cohesive platform without the overhead of setting up resources, maintenance, and configuration. Fabric wouldn’t be an end-to-end data analytics platform without data science, so in this blog we will explore the data science and machine learning capabilities of Microsoft Fabric and assess where the platform fits in the completive data science landscape.

To learn more about Fabric, check out our previous blogs: What is Microsoft Fabric? An Introduction — Advancing Analytics

Data Science in Fabric

In Fabric, there are 3 artifacts that are of interest to Data Scientists: Notebooks, Experiments and Models.

  • Notebooks can be written in PySpark, SparkSQL and SparkR to perform a wide range of essential ML tasks. From the very beginning of the ML lifecycle like EDA, feature selection and hyper-parameter tuning to the end such as Batch Inferencing and Drift Monitoring, notebooks will underpin the entire workflow much like in comparable tools such as Databricks and Azure Machine Learning.

  • Experiments serve as the primary unit of organisation and control for all related machine learning runs making comparing evaluation metrics simple and allowing you to register the best performing model with the click of a button.

  • Models in Fabric can be registered and will appear as an artifact allowing easy evaluation and versioning for batch inferencing.

One underrated feature of Fabric is the pre-made model training notebooks. Data Scientists can explore Microsoft’s library of tutorials for popular use cases such as Fraud Detection, Customer Segmentation and Recommendations.

Comparison to Synapse Data Science

We can’t talk about Fabric without mention Azure Synapse, Microsoft’s previous all-in-one lakehouse platform. Synapse presence is still felt in the Data Science capabilities of Fabric through SynapseML, previously known as MMLSpark. In terms of Data Science, Fabric feels very familiar to Synapse if it wasn’t for one great improvement…

MLFlow!!

The key differentiator that sets Fabric apart from Synapse is the integration of MLFlow. MLFlow is the leading model registry and tracking platform essential for data scientists and ML engineers to manage experiments and models, package code, and deployment ensuring reproducibility. It’s hard to imagine a data science workflow without it, or an equivalent. The fact that Fabric has managed MLFlow built into the platform make Fabric far more compelling for data scientists.

MLFlow serves as the foundation for both experiments and models within Fabric, playing a significant role in their usability.

Shortfalls

While Fabric aims to be the only data platform you will need, the workspace is clearly aimed at simplifying data engineering and analytics workloads, with no folder structure to organise resources. This can be troublesome for data science workloads, which are by their nature very experimental. Whether you are conducting exploratory data analysis, feature selection or hyperparameter tuning, data scientists will be creating a lot of code and complex workflows. Therefore, the ability to organise notebooks, experiments and models is essential, especially for collaborative teams. We recommend a standardised naming convention to help organise workspaces better: What's in a name? Naming your Fabric artifacts — Advancing Analytics

Additionally, Fabric is missing one major feature that is slowly becoming a must-have for ML platforms. Serving real-time scoring models is a major part of the MLOps lifecycle with so many use cases requiring low-latency inferencing. Currently Fabric has no mechanism for this, forcing users to leave the platform to host their models using other technologies such as Kubernetes or Azure Functions adding complexity and going against the idea of Fabric being a one-stop-shop.

Conclusion

It’s easy to identify the data science capabilities that are missing in Fabric, which is expected in a platform very geared towards data pipelines and building a cohesive engineering platform, but that doesn’t mean it doesn’t have its place in the machine learning landscape.

For simple uses cases, and organisations early in their data science journey, Fabric is a compelling platform for Data Analysts and Scientists. Without spinning up any new resources, they are able to create, evaluate and experiment with machine learning models with MLFlow being the game changer for the platform. As Fabric is nearing its public release, we are excited to see what new ML updates the platform has in store for us.

If you’re looking to get started with your Fabric journey, check out our POC offering: Download Microsoft Fabric Proof of Concept Flyer (advancinganalytics.co.uk)