Introduction
Are you suffering from poor data quality? Find out how we are helping an insurance company with their data quality journey using the new, free-to-use Databricks’ Data Quality Framework DQX! It is open-source, available in python language and comes with a lot of amazing features, including automatic PII detection, data profiling and running various data quality checks against different datasets.
In the current DQX demo notebooks, most of the examples follow rule configurations stored in YAML or JSON format. In this blog, I will be focusing on the tabular storage approach - a real insurance case-study that is being implemented as part of a data quality migration project.
Also, I will share practical recommendations for effectively using the Data Quality DQX framework including key considerations, design choices, and best practices to keep in mind.
If you are planning or starting a data quality project, I suggest you to check out the blog from my colleague Craig Porteous, which highlights key considerations when embarking your data quality journey.
Customer Challenge
The insurance company we worked with wanted to migrate their existing data quality platform to Databricks. This, of course, required understanding of their current as-is platform. Once we had a better understanding of their current as-is platform, we felt that Databricks’ Data Quality Framework DQX was a perfect fit as it offered native integration with Databricks without much work and gave us all the features that we were looking for as part of the migration. It also met the future capabilities that the business were looking for. Additionally, in their as-is world, to run all the data quality rules, it would take hours to days! DQX leverages the Apache spark architecture so we were confident that we could run the rules a lot faster in the to-be world.

Throughout the project, some key decisions were made that would affect how the new to-be world would behave. Below I will give some key recommendations that you may want to consider when using the DQX framework with a tabular approach to store the data quality rules. You can run different rules against different datasets through a common parameterised notebook.
Prerequisites
Before diving deeper, I recommend reading the blog from earlier this year that my colleague Aaron and I blogged on. This should give you a good idea on two key features:
- Data Profiling: Can profile your input data and generate summary statistics which then can be used as part of the data quality checks.
- Running Data Quality Checks: Run various rules that you have defined against an input dataset. For each row, you can see which rules were violated.
If you want to get started with DQX, I recommend looking at the official GitHub documentation along with the demo notebooks. Personally, my favourite one was the quick start demo notebook as it gave a quick comprehensive introduction on executing various data quality rules against a sample dataset (in YAML configuration). If you need a thorough end-to-end example, including profiling, generating a more complex custom rule I recommend having a look at the manufacturing demo.
Existing Rules
If you currently have an existing data quality framework or platform, one key aspect you may need to consider is how you will migrate over the existing rules to a DQX compatible format. DQX offers various built-in functions that can be utilised. On top of that, in DQX you can have SQL expression quality checks, e.g. ‘FirstName is not null’ - which can be powerful if your existing rules are already in a SQL like syntax. Additionally, you can create you own custom rules using python functions.
If your current rules are expressed in SQL like syntax, you could utilise the DQX capabilities `function = sql_expression` to easily migrate over your existing rules. The caveat is that the SQL syntax must be Databricks compatible. Note that, sql_expression functions are currently evaluated at runtime only. At the time of implementation, what this meant was if the rule was expressed in an invalid SQL syntax, e.g. contains a random comma:
column1 >, 100
this failed the the entire DQX evaluation process. I raised a new feature with the DQX Team on GitHub, to flag this as part of the DQX output rather than a complete failure; I have been given confirmation that this feature is now implemented as part of the latest version v0.10.0.
If your current rules are not easily transferable, you may need to assess whether the in-built functions will meet your needs. If not, you do have the option to create your custom rules. I recommend having a look at the manufacturing example to see how custom functions can be utilised for your scenario.
Process to run the DQX Flow
A key consideration is if you want to run DQX as part of your existing ETL process or completely as a standalone process. In both cases, DQX is suitable for this. For the insurance customer, the decision was to develop a standalone process outside of the existing ETL pipelines. An example of a standalone process:

In the above diagram, an example end-to-end process is shown on how to run the DQX process for various business domains.
- Orchestration Table: Each record will reference one business domain, e.g. customers. This table acts as a configuration table, when a job/pipeline is run. For each record an independent notebook run is initiated (common parameterised notebook) with parameters supplied from the current record (row).
- Business Domains: Each business domain exists as a table or a view in Databricks’ Unity Catalog.
- Rules View (see section 'Storing Data Quality Rules'): The view will be read and filtered to only the rules which are applicable to the business domain, e.g. Customers, and where rules are active.
- The common parameterised notebook will run for each business domain and save the DQX results in a separate table.
Let’s walkthrough a scenario. Let’s assume we have 2 business domains, customers and finance and assume all the rules have already been added in the rules table. To run the process, we add 2 records in the orchestration table. Example:

Note that we also have the metadata columns in the orchestration table. We have the _is_active column giving the ability to “switch off” business domains where DQX process does not need to run. During the common notebook run, the business_domain and dataset_location values are passed as parameters to the notebook.
In the common parameterised notebook, to find all the rules related to a single business domain only, we can construct a python string variable that concatenates the business_domain and a literal string 'true' together (e.g. ‘Customers-true’). This can be used as part of the run_config_name. When loading the checks from the rules view, to get the applicable rules, example:

Next, the DQX Engine can be run. As the input dataset is passed in as a parameter to the notebook (dataset_location column from orchestration table) you can kick off the DQX Engine.
There are various ways you can utilise the output, e.g. quarantine bad records, or simply flag the issues in _errors and _warning columns and save this as a table. I recommend saving the results so you can post-process the results later. If the input dataset is a Databricks view, where the logic is complex, saving the DQX results is strongly recommended before proceeding with any post-processing.
During the implementation, if a rule references a column that does not exists in the input dataset, the DQX process failed entirely. This should now be fixed in the v0.10.0.
Storing Data Quality Rules
DQX provides various ways to store the data quality rules including in JSON, YAML, programmatic and table format options. When choosing how to store the rules, some of the questions you should ask are:
- Who will be responsible for maintaining the rules?
- How will rule changes/addition be reviewed and approved?
- Do we need auditability or version control for rule changes?
- How many rules do we expect to manage over time?
- Will rules need to support multiple data sources or domains?
- How easy is it for stakeholders to understand and modify rules?
- Do we need a user interface for rule management?
- Should rules be documented alongside their logic?
The answers to some of the questions could lead you to choose a particular rule storage option. For the insurance customer implementation, a tabular storage approach was used. An introduction to using the tabular approach can be found in the official GitHub documentation (you may need to expand the section 'Saving quality rules in a Delta table without using DQX methods').
When using the table storage approach, I preferred to have a table and then a view that references the table which is effectively the input to the DQX Engine:

The below code snippet shows an example how the rules table could be created. The metadata columns are good to have for audit purposes. I recommend having an '_is_active' equivalent column to potentially switch off rule(s) if they are not needed as part of the DQX run.
In my table, rather than have a 'check' column for the DQX engine, I have split this into further 3 columns (arguments, function and for_each_column) for readability purposes. This would also help when inserting/updating rules. For example if only the rules syntax need to change, you can update the arguments column only.
The view will effectively create a struct from these 3 columns and filter the table where rules are marked as active (_is_active = true). Additionally in the view, the run_config_name will concatenate the business domain and the is_active flag so it can be later utilised to filter only the applicable rules for the particular business domain (e.g. Customers).
Table Creation:
CREATE TABLE workspace.config.rules ( id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1), business_domain STRING, name STRING, criticality STRING, `function` STRING, for_each_column ARRAY<STRING>, arguments MAP<STRING, STRING>, filter STRING, user_metadata MAP<STRING, STRING>, -- Metadata columns: _created_timestamp TIMESTAMP, _modified_timestamp TIMESTAMP, _created_by STRING, _modified_by STRING, _is_active BOOLEAN) USING DELTA;
View Creation:
CREATE VIEW workspace.config.vw_rules as ( select id, name, criticality, named_struct( 'function', function, 'for_each_column', for_each_column, 'arguments', arguments ) AS check, filter, concat_ws('-', business_domain, `_is_active` ) as run_config_name, user_metadata from workspace.config.rules where `_is_active` = true)Below shows an example of an insert into statement to insert a rule in the rules table. Depending on how rules should be inserted, you could automate this functionality, including if new rules are to be added as part of a process / using a user interface (e.g. Databricks Apps).
insert into workspace.config.rules ( business_domain, name, criticality, function, for_each_column, arguments, filter, user_metadata, -- Metadata columns: _created_timestamp, _modified_timestamp, _created_by, _modified_by, _is_active )values ( 'Customers', 'CustomerAgeThreshold', 'warn', 'is_in_range', array(), map( 'column', 'age', 'min_limit', '18', 'max_limit', '120' ), null, map('check_owner', 'dummy@email.com'), -- Metadata columns: current_timestamp(), current_timestamp(), 'example', 'example', true )When querying the view, the below output is visible:

With the above approach, each row will represent a rule. You can have mix n’ match rules, including in-built functions (e.g. is_in_range), sql_expression and even custom rules. The column named function in the rules table will need to store the correct function type/name.
Few recommendations when storing rules in a table:
- If your rule expression contains the double quote character ("), when storing in the rules table, Databricks will escape this character with a backslash (\). Example: if you have a sql_expression type rule, and you are inserting the rule in the format of:
When you view the rules table or rules view, you will see this:map('expression', 'Country in ("Germany", "France", "Italy") ','msg', "Country is not in defined list") -- arguments column
DQX will still work with the escaped character when reading the arguments part. However, for readability and debugging purposes this is not advisable (e.g. a data engineer wants to copy and paste the expression part to query a table with a where clause). Instead of using a double quote, use a single quote ('). - Ensure you are storing the arguments in a string format. For example ensure even numeric arguments are in a string format (either in single quotes or double quotes), example:

Next Steps
For the insurance customer, the next steps are to build a data model for Power BI (reporting tool). However, if you are currently looking to explore data quality dashboard reports, I would recommend to have a look at the reports that DQX already comes with (Databricks dashboards).
If your plan is to quarantine the data (not allow data to flow downstream, e.g. silver to gold), you may need to establish a process on how data can be rectified (e.g. at source or at the current layer) and then be allowed to flow downstream. In certain situations, quarantining data may not be suitable. For example, with financial transactions missing or delayed numerical values, such as amounts, balances, or timestamps, can lead to inaccurate reporting that compromise the integrity of financial statements.
Conclusion
This blog highlights a case-study on how Databricks' Data Quality Framework (DQX) is being used for an insurance company. Using a DQX pattern highlighted in this blog, you will be able to build a scalable data quality solution. DQX is a great way to save time, reduce costs, and open the doors to future use-cases. It is open-source, free, and easily integrates with your existing Databricks ecosystem.
Even if you have an existing data quality platform, migrating your rules to DQX can simplify operations and improve scalability. Personally, I find storing rules in a tabular format makes them easier to manage and update. Adding an audit trail to track rule changes, e.g. capturing who made the change, when it happened, and why, can significantly improve transparency and governance. Using Databricks' Unity Catalog to manage this audit process brings structure and simplicity, making it a clean and scalable solution.
If you're looking to level up your data quality processes, DQX is definitely worth exploring. Our team at Advancing Analytics can help with your data quality journey. Get in touch today to find out more!
Topics Covered :
Author
Sargithan Senthilselvan