AI-assisted Entity Resolution on Databricks for a global media group
Industry
Media & Entertainment
Challenge
An existing deterministic matching process linked company records to a third-party entity provider but only achieved 60% matching accuracy, required manual remediation, and struggled with operational naming (brand and subsidiary variations).
Results
A Databricks-based, AI-assisted enrichment and operational naming approach designed to improve entity matching beyond the existing baseline, reduce manual remediation, and handle incomplete or variant data at scale using a hybrid rules + ML + GenAI pattern.
Solution Type
AI, MLFlow, Databricks, Governance
Overview
The Challenge
A key pain point was “operational naming”, automatically grouping related names and brands so the organisation could recognise when multiple references actually point to the same parent entity.
The Solution
1) Discovery, profiling, and prioritisation of gaps
We ran discovery to confirm enrichment use cases and define how missing fields should be inferred or enriched, then profiled approximately 40k records to identify common gaps such as missing location fields.
2) A hybrid rules + ML + GenAI approach on Databricks
Rather than replacing existing matching logic, we enhanced it with a hybrid pattern combining rules, machine learning, and GenAI to better handle variant data, ambiguity, and incomplete records.
3) Agentic enrichment using Databricks AI capabilities
We designed an agentic framework where agents follow a structured procedure for estimating attributes in “bronze” category records, leveraging Databricks AI capabilities to add contextual reasoning on top of fuzzy matching. The approach supports categorising enriched records into tiers (for example, bronze for inferred data and silver for higher-confidence enrichment).
4) Business value with AI/BI and Genie experiences
5) Operational naming as a first-class capability
Tools & Technologies
Built on Databricks using an agentic enrichment approach, hybrid rules + ML + GenAI patterns, and MLflow-based tracking to support governed evaluation at scale.
- Databricks for orchestration of the enrichment workflow and scalable processing.
- MLflow for tracking and evaluation of LLM responses and enrichment outcomes.
- ai_query() and tiering (bronze/silver) to distribute and manage inferences at scale.
The Results
The delivered approach created a scalable foundation for improving entity resolution beyond an existing 60% baseline by enriching missing fields, reducing ambiguity in matching, and systematically handling naming variations. It also reduced reliance on manual remediation by formalising enrichment steps into a repeatable agent workflow.
Critically, the work established a pattern that can be evaluated and governed: enriched records are tiered, inference can be distributed using platform-native capabilities, and responses are tracked through MLflow to support observability and improvement over time
