Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

 

Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.


Getting in to Business Intelligence in 2019

A good friend of mine asked me a really interesting question. "How do you get started in business intelligence". In January you might be asking a similar question. How you transition in to a new role, where should you look for good resources? I want to use this blog to answer that question with a wider stance, looking at the skills you need for BI in 2019. I hope you do not mind me sharing this!

Business Intelligence today is vastly different to what it was when I started working in BI 11 years ago. In 2008, if you were a Microsoft shop, then the tech stack choice was easy. You were looking at Microsoft SQL Server 2008 as a storage engine, SQL Server Integration Services for traditional ETL (Extract-Transform-Load), SQL Server Analysis Services (multi-dimensional - No tabular then) for a semantic layer with SQL Server Reporting Services as the presentation layer. At the time these 4 components were incredibly daunting, how could anyone master any of these skills let alone them all. On top of SQL, SSIS, SSRS and SSAS you may have used a few other tools, Excel, other databases, Business Objects etc. In 2008, the data I was working with looked huge, but in all honesty it was so small compared with the amount of data we process today. Big Data processing was still quite new at this point, the Google MapReduce paper was published 4 years prior and Hadoop was still in version 1 with Doug Cutting working at Yahoo.

The reason I wanted to talk about BI then and BI now, is that BI has changed a lot in just over 10 years. There are many reasons for this, here is a list of the key ones I have seen:

  • Business becoming more data aware - If you look back at all the companies who are no longer trading, most admitted it was in part due to not staying ahead with their business. Gartner has been saying for years, Businesses who do not embrace data analytics will not exist in 10 years - they said that almost 10 years ago, and some businesses have not listened. We live in a data-driven economy.

  • Democratisation of Big Data - In 2008 if you wanted to work with Hadoop and MapReduce you had to write MapReduce jobs in Java by hard - Hive and Pig were in their infancies. Now if you want to work with Big Data it is as simple as starting a Databricks cluster or running an Azure Data Lake Analytics job. You do not need to know how to write complex MapReduce Jobs, or how to write code so it is cluster aware. Big Data processing for a large part has been democratised and made accessible by the masses.

  • "Data Science is the sexiest job of the 21 century" - Business intelligence is morphing in to Data science. A lot of people say that most data science questions and problems could be solved with a data warehouse, and for some teams that is true. I will write about this at some point. A lot of Data Science is prepping data for exploration and exploitation. Machine Learning is changing how we view the world. A lot of business intelligence developers are switching to Data Science.

  • The Cloud - This is a big one! The Cloud has changed most of IT. In 2008 and I was running a data warehouse on a 3GB Server with 2 cores. It was underpowered then. Moving to the cloud changed this massively. At first it was a lift and shift to Infrastructure-As-A-Service IAAS, running the server in the cloud, then Platform-As-A-Service became available - Azure SQL Database for example, then Software-As-A-Service - Azure Databricks. The Cloud made it simpler, cheaper and less risky to build data intensive processes.

In summary, businesses being more aware of the value of their data, easier to use Big Data tooling, data science and the cloud have had a radical impact on what is Business Intelligence in 2019. It is harder to break in to Business Intelligence than ever before. But do no be deteried! While it is harder, it is not impossible. There is a huge demand for people who can look at data and make a business decision.

What is the tech stack for a Microsoft BI developer in 2019?

I am making an assumption that you're working for a company who are looking to or are already in Azure. The cloud is not just someone else's server. Working with BI in the cloud is very different to a single box. The traditional ETL approach does not scale. ELT is more appropriate. Extract-Load-Transform. We take data from our source systems either in batch or consumed as a stream, and we load it somewhere. We have a lot of data, too much to do all the transformations on a single machine in-memory. We need to land the data and push the computation to the data where it lives. This is MapReduce. Map a function across your data, execute that function then reduce the results.

I Mentioned that when I started there was only really SQL Server, well there is a lot more you need to know about in 2019.

  • Azure Data Factory - When you want to move data or orchestrate data, this is the tool you’re looking for. With Data Factory Flow (in preview at time of writing) you gain the ability to do more. Flow gives you data flow transformations. What makes this even better is that it build scala in the background is executed on Azure Databricks. So these transformations are at big data scale.

  • Azure Stream Analytics - If you have data collecting in an event hub and you want to push that data to multiple sources, then Stream Analytics is a good option. Stream Analytics sort of works a little like Apache Kafka. A stream of data comes in. and then that stream can be combined with static data and pushed off to a variety of sources. This is a major component in the Real-Time Power BI.

  • Azure Databricks - Heard of Apache Spark? MapReduce? Big Data? Well this is the option for Big Data Processing in Azure. Spark is open-source, so you could if you wanted, grab a load of machines and run your own big data cluster in your basement. The problem with that, it it is hard! You need to manage all the machines, patch them, manage the disks, pay for them, pay for the cooling etc. Azure Databricks is a software-as-a-service offering of Apache Spark, made by the creators of Apache Spark. If you have big data, then you want to be looking at this.

  • Azure Data Lake Store - We need somewhere to store the data we are collecting. Blob storage is an option so is Data Lake Store. Blob is great for some use cases, Data Lake is great for others. The biggest one is security. Blog can manage security through keys and SAS tokens. It struggles to offer fine grain security which is linked with Azure Active directory. Data Lake Store does this. Does it do it well? Not always. The security object (ACL) is very small and assigning permissions to a large set of files is done on a file by file basis. This is painful, so setting up the right security need to be thought of at the start.

  • Azure Data Lake Analytics - Data Lake Analytics is an interesting one. Data Lake Analytics has fallen out of favour in the last 6 months. Data Lake Analytics enabled an easy way to write MapReduce jobs across large amounts of data in Data Lake Store or Blob storage. The language USQL is a combination of SQL and C#. This made developing extensions for the language quite nice and easy. My major gripes with it was the cost and how it felt a lot like traditional disk based MapReduce. What I was wanting was the in-memory power of Apache Spark. When Databricks arrived, Data Lake Analytics ceased to be an option for me. But, it may be an option for you. For now, it is not a recommendation I make to my customers.

  • Azure Blob Storage - Do you need to land a load of data in the cheapest way possible, then Blob is right for you. If you need fine grain permissions the Data Lake Store is a better option. How do you decide which option is best? It is all about the non-functional requirements - More on that in another blog.

  • Azure Cosmos DB - I love Cosmos DB. This thing is versatile! I often talk about Polyglot Persistence a term coined by Martin Fowler. Basically, we want to use the right data store for the problem we are facing. Do not just put everything in a relational database, choose the right store. Do you have a lot of JSON documents, then use a document DB, lots of Key-Value pairs, then a Key-Value store, Are you looking for the relationships between objects, then a Graph database would be a good option. Azure Cosmos DB has it all. It typically forms a major component in most of our architectures. I am in love with Gremlin and find new and interesting ways to use it. If you need to store JSON documents or look at highly connected data, then Cosmos DB is for you. It is also Platform-as-a-service, so you do not need to worry about setting and installing NOSQL tools.

  • Event Hubs / Service Bus - When you need an event driven architecture in Azure, you will most likely need to have one of the following. An Event Hub or a Service Bus. Sometimes you might want an IoT hub too. These tools allow you to send a message to a queue and do a series of actions based on that message. These tools can be as simple or as complicated as you like. I use Service bus topics a lot to pass messages around data intensive applications. It works a charm. If you need a message driven architecture then Event Hubs or Service bus is what you need.

  • Logic Apps - Knowing how these tools all work is one thing, getting them all to work together is very different. A simple option for this is with a Logic App. Logic Apps are a simple service to orchestrate a lot of components together. This could be triggered in a number of different ways, on a wall clock, ad-hoc or message driven. Hooked in with Service Bus and Event Hubs, this is a great way to create very generic data processing pipelines.

  • Azure PAAS Database / Managed Instance - We work with data, you will need a database. In Azure there are so many different options for running SQL Server. I have included a few of these below. One of those options is the Platform-as-a-service version of SQL Server. The obvious question that comes with this option is can i use it for a data warehouse? Yes you could, but you should not. Why not? Well it is not designed for that kind of workload. PAAS SQL DB is designed for Online-Transactional-Processing (OLTP). OLTP workloads are typically singleton updates and inserts. Online-analytical-processing OLAP is quite different. OLAP typically looks at large volumes of data which is aggregated and subsetted. If you have a small data warehouse this might be an option. If not, then you may want to look at something with a bit more kick.

  • Azure SQL Data Warehouse - You might not remember Parallel Data Warehouse or the Analytics Platform System (PDW & APS). PDW was a large component tool from Microsoft for parallel data warehouse processing. It was expensive, very expensive. As a result, the uptake was not huge. What PDW did, it did pretty well. Azure sQL Data warehouse is PDW in the cloud. Under the hood, Azure SQL Data Warehouse is 60 instance of SQL Server working together to process your data in parallel. If you have big data and you want to query huge volumes of data in a SQL Database, that Azure SQL Data Warehouse is for you.

  • Azure Analysis Services - In most data warehouse scenarios, you need a semantic layer. This is what Azure Analysis services offers you. Azure SQL Data Warehouse is great at crunching through large volumes of data, can you point Power BI at it and do interactive queries? You can, but you should not. Azure sQL Data Warehouse has issue with concurrency, I say issues, these are design decisions. It is not intended for lots of small queries. For that type of workload you need to select the best tool for the job. Azure Analysis Services is the tabular SQL Server Analysis Services engine, but made available as a Platform-as-a-service offering. It is great when you have lots of data and you want to enable interactive reporting with Power BI.

  • Power BI - I will keep this one short, as I am sure everyone knows what Power BI is. Power BI is an interactive data visualisation tool It is great. You need to learn how to use it. Knowing how to use a tool like Power BI is only part of the solution to making compelling data visualisations. To really get your point across you need to understand how we consume data. Read the books by Stephen Few, Edward Tufte, Alberto Cairo and others. Learn about cognitive processing and data-ink ratios. It will make your dashboard more compelling.

  • HDInsight - HDInsight is another tool which has fallen a little out of favour in the last 6 months. HDInsight was aligned to the HortonWorks distribution. Everytime that was updated, HDInsight would also get an update. HDI supports Spark, Storm, Hive, Hadoop, Kafka and more.

  • Azure Machine Learning Studio - This is for getting started in Data Science and Machine Learning. I do not recommend it for production. Use it to learn more about machine learning then move to Python. It works and what is does it does well. My problem is that it is too limited.

  • Azure Machine Learning Services - If Azure Machine learning Studio is too limited, what is the alternative?

That is a lot to get your head around. In the cloud there is no one tool that does it all. We need to start thinking in a Polyglot way - select the right tool for the problem we are facing. Each one of these tools forms a part of an architecture. The toll selection is based on the functional and non-functional requirements (cost, latency, IO, security etc). Getting this right is hard.

So where should you start? Take the skills you have already and map them to the list above. Start with the problem you have. If that problem is getting data, then look in to Data Factory, once you have data you need to store it somewhere, so look in to Blob, Data Lake, ComosDB, PAAS SQL DB. Start small and build up to something much bigger. 

How much Machine Learning do you need to know?
This depends on how much additional value you want to bring to the business you work for, and to an extent how much more money you want to make. Data Science skills are hugely in demand, and that demand is only increasing. There are a lot of similarities between Data Science and Business Intelligence, but I like to think that BI analyses what has happened and data science looks for what will happen.

How much Big Data processing do you need to know?
If you do not have big data needs today, I am confident you will in the future. Start upskilling here too. Look at Azure Databricks. Spark is a platform for working with huge volumes of data in a parallel way. You will want to do this at some point in the future. Why else should you look at Apache Spark? I have mentioned this in videos and in other blogs, knowing Spark will make you more money. It is that simple.

How much Cloud processing do you need to know?
Loads of it. Simple answer to this one. Look at that list again (above). The cloud makes BI faster to get started and harder to achieve. Knowing what each tool is and how it works is one thing, knowing how they interact together is something very different. This is where you need to be aiming for.

Is Kimball Data Warehousing still relevant?
Yes. There are alternatives, but if you're building a decision support system, the old methods are still good. For real time processing and real-time reporting you may want to look at Lambda or Kappa architectures.

If you have a question, please add a comment and I will try to answer it and add it to this blog.