Skip to content

AI-assisted Entity Resolution on Databricks for a global media group

dbx_ai_webinar_series_hero_2

Industry

Media & Entertainment

Challenge

An existing deterministic matching process linked company records to a third-party entity provider but only achieved 60% matching accuracy, required manual remediation, and struggled with operational naming (brand and subsidiary variations).

Results

A Databricks-based, AI-assisted enrichment and operational naming approach designed to improve entity matching beyond the existing baseline, reduce manual remediation, and handle incomplete or variant data at scale using a hybrid rules + ML + GenAI pattern.

Solution Type

AI, MLFlow, Databricks, Governance

dbx_webinar_genbi_event_icon

Overview

A global advertising and media group needed a faster, more accurate way to connect company records across internal systems, without relying on manual cleanup to compensate for missing fields, naming inconsistencies, and brand variations. Their existing matching workflow was effective for straightforward cases, but it capped out at around 60% accuracy and created a growing manual burden.
 
Advancing Analytics delivered an AI-assisted approach on Databricks to enrich incomplete records and improve operational naming, using an agentic pattern that can scale across thousands of records of varying quality.

The Challenge

The organisation’s entity resolution relied on deterministic matching via a third-party entity provider, which meant it struggled when records were incomplete, ambiguous, or inconsistent (for example, variations in how businesses and brands are named across systems). This increased manual remediation and slowed efforts to create a reliable, centralised view of company entities.


 A key pain point was “operational naming”, automatically grouping related names and brands so the organisation could recognise when multiple references actually point to the same parent entity.
 

The Solution

 

1) Discovery, profiling, and prioritisation of gaps

We ran discovery to confirm enrichment use cases and define how missing fields should be inferred or enriched, then profiled approximately 40k records to identify common gaps such as missing location fields.

2) A hybrid rules + ML + GenAI approach on Databricks

Rather than replacing existing matching logic, we enhanced it with a hybrid pattern combining rules, machine learning, and GenAI to better handle variant data, ambiguity, and incomplete records.

3) Agentic enrichment using Databricks AI capabilities

We designed an agentic framework where agents follow a structured procedure for estimating attributes in “bronze” category records, leveraging Databricks AI capabilities to add contextual reasoning on top of fuzzy matching. The approach supports categorising enriched records into tiers (for example, bronze for inferred data and silver for higher-confidence enrichment).

4) Business value with AI/BI and Genie experiences

The agent workflow was designed to adhere to governance rules, with response tracking and evaluation using MLflow, and platform-native functions such as ai_query() referenced for distributing inferences at scale.
 

5) Operational naming as a first-class capability

We targeted operational naming so related brands and naming variants can be grouped consistently, reducing fragmentation across internal datasets and improving the usefulness of downstream views.
 
 

Tools & Technologies

Built on Databricks using an agentic enrichment approach, hybrid rules + ML + GenAI patterns, and MLflow-based tracking to support governed evaluation at scale.
 
    • Databricks for orchestration of the enrichment workflow and scalable processing.
    • MLflow for tracking and evaluation of LLM responses and enrichment outcomes. 
    • ai_query() and tiering (bronze/silver) to distribute and manage inferences at scale.

The Results

The delivered approach created a scalable foundation for improving entity resolution beyond an existing 60% baseline by enriching missing fields, reducing ambiguity in matching, and systematically handling naming variations. It also reduced reliance on manual remediation by formalising enrichment steps into a repeatable agent workflow.

Critically, the work established a pattern that can be evaluated and governed: enriched records are tiered, inference can be distributed using platform-native capabilities, and responses are tracked through MLflow to support observability and improvement over time

Ready to get started?