Are you suffering from poor data quality? Find out how we are helping an insurance company with their data quality journey using the new, free-to-use Databricks’ Data Quality Framework DQX! It is open-source, available in python language and comes with a lot of amazing features, including automatic PII detection, data profiling and running various data quality checks against different datasets.
In the current DQX demo notebooks, most of the examples follow rule configurations stored in YAML or JSON format. In this blog, I will be focusing on the tabular storage approach - a real insurance case-study that is being implemented as part of a data quality migration project.
Also, I will share practical recommendations for effectively using the Data Quality DQX framework including key considerations, design choices, and best practices to keep in mind.
If you are planning or starting a data quality project, I suggest you to check out the blog from my colleague Craig Porteous, which highlights key considerations when embarking your data quality journey.
The insurance company we worked with wanted to migrate their existing data quality platform to Databricks. This, of course, required understanding of their current as-is platform. Once we had a better understanding of their current as-is platform, we felt that Databricks’ Data Quality Framework DQX was a perfect fit as it offered native integration with Databricks without much work and gave us all the features that we were looking for as part of the migration. It also met the future capabilities that the business were looking for. Additionally, in their as-is world, to run all the data quality rules, it would take hours to days! DQX leverages the Apache spark architecture so we were confident that we could run the rules a lot faster in the to-be world.
Throughout the project, some key decisions were made that would affect how the new to-be world would behave. Below I will give some key recommendations that you may want to consider when using the DQX framework with a tabular approach to store the data quality rules. You can run different rules against different datasets through a common parameterised notebook.
Before diving deeper, I recommend reading the blog from earlier this year that my colleague Aaron and I blogged on. This should give you a good idea on two key features:
If you want to get started with DQX, I recommend looking at the official GitHub documentation along with the demo notebooks. Personally, my favourite one was the quick start demo notebook as it gave a quick comprehensive introduction on executing various data quality rules against a sample dataset (in YAML configuration). If you need a thorough end-to-end example, including profiling, generating a more complex custom rule I recommend having a look at the manufacturing demo.
If you currently have an existing data quality framework or platform, one key aspect you may need to consider is how you will migrate over the existing rules to a DQX compatible format. DQX offers various built-in functions that can be utilised. On top of that, in DQX you can have SQL expression quality checks, e.g. ‘FirstName is not null’ - which can be powerful if your existing rules are already in a SQL like syntax. Additionally, you can create you own custom rules using python functions.
If your current rules are expressed in SQL like syntax, you could utilise the DQX capabilities `function = sql_expression` to easily migrate over your existing rules. The caveat is that the SQL syntax must be Databricks compatible. Note that, sql_expression functions are currently evaluated at runtime only. At the time of implementation, what this meant was if the rule was expressed in an invalid SQL syntax, e.g. contains a random comma:
column1 >, 100
this failed the the entire DQX evaluation process. I raised a new feature with the DQX Team on GitHub, to flag this as part of the DQX output rather than a complete failure; I have been given confirmation that this feature is now implemented as part of the latest version v0.10.0.
If your current rules are not easily transferable, you may need to assess whether the in-built functions will meet your needs. If not, you do have the option to create your custom rules. I recommend having a look at the manufacturing example to see how custom functions can be utilised for your scenario.
A key consideration is if you want to run DQX as part of your existing ETL process or completely as a standalone process. In both cases, DQX is suitable for this. For the insurance customer, the decision was to develop a standalone process outside of the existing ETL pipelines. An example of a standalone process:
In the above diagram, an example end-to-end process is shown on how to run the DQX process for various business domains.
Let’s walkthrough a scenario. Let’s assume we have 2 business domains, customers and finance and assume all the rules have already been added in the rules table. To run the process, we add 2 records in the orchestration table. Example:
Note that we also have the metadata columns in the orchestration table. We have the _is_active column giving the ability to “switch off” business domains where DQX process does not need to run. During the common notebook run, the business_domain and dataset_location values are passed as parameters to the notebook.
In the common parameterised notebook, to find all the rules related to a single business domain only, we can construct a python string variable that concatenates the business_domain and a literal string 'true' together (e.g. ‘Customers-true’). This can be used as part of the run_config_name. When loading the checks from the rules view, to get the applicable rules, example:
Next, the DQX Engine can be run. As the input dataset is passed in as a parameter to the notebook (dataset_location column from orchestration table) you can kick off the DQX Engine.
There are various ways you can utilise the output, e.g. quarantine bad records, or simply flag the issues in _errors and _warning columns and save this as a table. I recommend saving the results so you can post-process the results later. If the input dataset is a Databricks view, where the logic is complex, saving the DQX results is strongly recommended before proceeding with any post-processing.
During the implementation, if a rule references a column that does not exists in the input dataset, the DQX process failed entirely. This should now be fixed in the v0.10.0.
DQX provides various ways to store the data quality rules including in JSON, YAML, programmatic and table format options. When choosing how to store the rules, some of the questions you should ask are:
The answers to some of the questions could lead you to choose a particular rule storage option. For the insurance customer implementation, a tabular storage approach was used. An introduction to using the tabular approach can be found in the official GitHub documentation (you may need to expand the section 'Saving quality rules in a Delta table without using DQX methods').
When using the table storage approach, I preferred to have a table and then a view that references the table which is effectively the input to the DQX Engine:
The below code snippet shows an example how the rules table could be created. The metadata columns are good to have for audit purposes. I recommend having an '_is_active' equivalent column to potentially switch off rule(s) if they are not needed as part of the DQX run.
In my table, rather than have a 'check' column for the DQX engine, I have split this into further 3 columns (arguments, function and for_each_column) for readability purposes. This would also help when inserting/updating rules. For example if only the rules syntax need to change, you can update the arguments column only.
The view will effectively create a struct from these 3 columns and filter the table where rules are marked as active (_is_active = true). Additionally in the view, the run_config_name will concatenate the business domain and the is_active flag so it can be later utilised to filter only the applicable rules for the particular business domain (e.g. Customers).
Table Creation:
CREATE TABLE workspace.config.rules ( id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1), business_domain STRING, name STRING, criticality STRING, `function` STRING, for_each_column ARRAY<STRING>, arguments MAP<STRING, STRING>, filter STRING, user_metadata MAP<STRING, STRING>, -- Metadata columns: _created_timestamp TIMESTAMP, _modified_timestamp TIMESTAMP, _created_by STRING, _modified_by STRING, _is_active BOOLEAN) USING DELTA;
View Creation:
CREATE VIEW workspace.config.vw_rules as ( select id, name, criticality, named_struct( 'function', function, 'for_each_column', for_each_column, 'arguments', arguments ) AS check, filter, concat_ws('-', business_domain, `_is_active` ) as run_config_name, user_metadata from workspace.config.rules where `_is_active` = true)Below shows an example of an insert into statement to insert a rule in the rules table. Depending on how rules should be inserted, you could automate this functionality, including if new rules are to be added as part of a process / using a user interface (e.g. Databricks Apps).
insert into workspace.config.rules ( business_domain, name, criticality, function, for_each_column, arguments, filter, user_metadata, -- Metadata columns: _created_timestamp, _modified_timestamp, _created_by, _modified_by, _is_active )values ( 'Customers', 'CustomerAgeThreshold', 'warn', 'is_in_range', array(), map( 'column', 'age', 'min_limit', '18', 'max_limit', '120' ), null, map('check_owner', 'dummy@email.com'), -- Metadata columns: current_timestamp(), current_timestamp(), 'example', 'example', true )When querying the view, the below output is visible:
With the above approach, each row will represent a rule. You can have mix n’ match rules, including in-built functions (e.g. is_in_range), sql_expression and even custom rules. The column named function in the rules table will need to store the correct function type/name.
Few recommendations when storing rules in a table:
map( 'expression', 'Country in ("Germany", "France", "Italy") ', 'msg', "Country is not in defined list" ) -- arguments columnFor the insurance customer, the next steps are to build a data model for Power BI (reporting tool). However, if you are currently looking to explore data quality dashboard reports, I would recommend to have a look at the reports that DQX already comes with (Databricks dashboards).
If your plan is to quarantine the data (not allow data to flow downstream, e.g. silver to gold), you may need to establish a process on how data can be rectified (e.g. at source or at the current layer) and then be allowed to flow downstream. In certain situations, quarantining data may not be suitable. For example, with financial transactions missing or delayed numerical values, such as amounts, balances, or timestamps, can lead to inaccurate reporting that compromise the integrity of financial statements.
This blog highlights a case-study on how Databricks' Data Quality Framework (DQX) is being used for an insurance company. Using a DQX pattern highlighted in this blog, you will be able to build a scalable data quality solution. DQX is a great way to save time, reduce costs, and open the doors to future use-cases. It is open-source, free, and easily integrates with your existing Databricks ecosystem.
Even if you have an existing data quality platform, migrating your rules to DQX can simplify operations and improve scalability. Personally, I find storing rules in a tabular format makes them easier to manage and update. Adding an audit trail to track rule changes, e.g. capturing who made the change, when it happened, and why, can significantly improve transparency and governance. Using Databricks' Unity Catalog to manage this audit process brings structure and simplicity, making it a clean and scalable solution.
If you're looking to level up your data quality processes, DQX is definitely worth exploring. Our team at Advancing Analytics can help with your data quality journey. Get in touch today to find out more!