Introduction
How we cut false positive PII detections significantly in a marine insurance enterprise search pipeline using LLM-based reasoning instead of traditional Named Entity Recognition.
When a client asked us to build an intelligent enterprise search system over their marine insurance document corpus, we hit a problem that looked simple on the surface: find and redact personally identifiable information (PII) before indexing. Names, emails, phone numbers. The usual suspects. What we didn't anticipate was just how spectacularly traditional approaches would fail in a domain where the language itself is designed to confuse.
This is the story of how we benchmarked four PII detection approaches, discovered that conventional NER tools were redacting ship names and insurance companies instead of people, and ultimately deployed a reasoning model that understood the difference.
The Problem: PII in Enterprise Search
Our client needed to make thousands of marine insurance documents (claims reports, loss prevention analyses, legal correspondence) searchable across their organisation. The compliance requirement was clear: no personally identifiable information could persist in the search index. Every person's name, email address, and phone number needed to be detected and redacted before indexing.
On paper, this is a well-trodden path. Tools like Microsoft's Presidio and GLiNER are purpose-built for exactly this kind of entity detection. They're fast, inexpensive, and widely adopted across industries. We started there.
Within days, we realised the marine insurance domain breaks every assumption these tools rely on. We considered cross-referencing global datasets of vessel names to solve the ambiguity, but that path proved unreliable. Even the best available registries suffer from inherent data quality issues and are never 100% accurate. Consequently, we could not rely on external lists to guarantee the strict compliance we needed.
When "Baltic Panther" Becomes a Person
Consider this sentence from a real legal ruling:
"Lord Denning ruled that the vessel Baltic Panther was unseaworthy."
A human reader parses this instantly. "Lord Denning" is a judge. "Baltic Panther" is a ship. "Unseaworthy" is a legal finding. But traditional NER models don't reason. They pattern-match. They've learned that capitalised multi-word phrases near verbs are often person names. And in marine insurance, that heuristic is catastrophic.
Our initial testing made the scale of the problem unavoidable. Across 4,434 benchmark samples, Presidio generated 13,536 false positive name detections. It flagged pronouns ("I"), vessel names ("ASL Scorpio"), organisations ("Deloitte & Touche"), and even countries ("Argentina", "Singapore") as person names. It also repeated the same name up to 16 times within a single document.
GLiNER was marginally better but still produced 6,108 false positives, frequently classifying the pronoun "I" as a named entity and tagging descriptors like "prominent businessman" and "seasoned data scientist" as person names.
Both tools achieved roughly 90% recall. They found most of the genuine names. But their precision was devastatingly low: 22.7% for Presidio and 39.6% for GLiNER. In practical terms, for every real name Presidio correctly identified, it also flagged roughly three things that weren't people at all.
In a domain saturated with vessel names, P&I club references, classification societies, and legal terminology, this isn't a minor inconvenience. Incorrectly redacting "Lloyd's Register" or "Gard P&I" corrupts the documents and renders the search index unreliable.
The Hypothesis: Reasoning Models Can Disambiguate
The fundamental limitation of NER models is that they classify tokens in isolation. They see "Baltic Panther" and ask "does this look like a name?" rather than "in the context of this sentence about maritime law, what is this entity?"
LLMs with reasoning capabilities do something qualitatively different. They process the full context of a sentence, understand domain-specific patterns, and apply genuine disambiguation. We hypothesised that this contextual intelligence would close the precision gap without sacrificing recall.
We tested two LLM-based approaches (GPT-4.1-nano and GPT-5-nano) against the same benchmarks, alongside Presidio and GLiNER, to find out.
The Benchmark: 4,434 Kaggle PII Samples, Four Approaches
We used a publicly available Kaggle PII dataset with verified ground truth labels, giving us an objective, external baseline. Every method was measured on the same data under the same conditions (Kaggle PII Dataset).
Overall Performance
| Method | Recall | Precision | F1 Score | Avg Response Time |
|---|---|---|---|---|
| GPT-5-nano | 75.0% | 91.7% | 82.5 | 3,001ms |
| GPT-4.1-nano | 75.9% | 88.7% | 81.8 | 1,250ms |
| GLiNER | 77.7% | 62.5% | 69.3 | 263ms |
| Presidio | 73.0% | 41.6% | 53.0 | 1,555ms |
The story here is precision. When GPT-5-nano identifies something as PII, it is correct 91.7% of the time. That directly translates to fewer false redactions, less manual review, and a search index people can actually trust.
Name Detection: Where the Gap Becomes a Chasm
Name detection is where LLMs truly separate themselves, and where it matters most in marine insurance.
| Method | Recall | Precision | F1 | False Positives |
|---|---|---|---|---|
| GPT-5-nano | 90.2% | 82.9% | 86.4 | 823 |
| GPT-4.1-nano | 86.1% | 80.7% | 83.3 | 910 |
| GLiNER | 90.1% | 39.6% | 55.0 | 6,108 |
| Presidio | 89.5% | 22.7% | 36.2 | 13,536 |
All four methods find a comparable proportion of genuine names. The difference lies entirely in what they incorrectly flag. GPT-5-nano produces 823 false positives across the entire benchmark. Presidio produces over 16 times that number.
Real Documents, Real Differences
The Kaggle benchmark told us GPT-5-nano was the strongest option. But benchmarks are benchmarks. We needed to validate against actual marine insurance documents. We ran both LLM models against 500 real claims and loss prevention reports from the client's corpus.
The results were striking. GPT-5-nano detected 3,009 names across the 500 documents compared to GPT-4.1-nano's 1,064, a difference of 183%. Both models agreed on 711 names (shared true positives). The divergence was in what each model found uniquely.
GPT-4.1-nano's 215 unique detections were predominantly false positives: vessel names (ASL Scorpio, Baltic Panther, Globe Explorer), organisations (Fish Health Inspectorate, Deloitte & Touche), partial words, and countries.
GPT-5-nano's 2,045 unique detections were predominantly genuine person names: full names, academic citations (Altinok, I., Bebak, J.), titled individuals (Lord Denning, Sir Nigel Teare), and contextual surnames correctly identified within legal text.
One case study encapsulated the difference perfectly. When processing the UNCTAD Maritime Transport Report, a document full of country statistics, company names, and shipping data, GPT-4.1-nano flagged 5 "names," all of which were false positives (companies and regions). GPT-5-nano correctly identified that the document contained no person names at all.
That's the difference between pattern-matching and reasoning.
The Cost Question
The obvious trade-off is cost. GPT-5-nano is approximately 60% more expensive per document than GPT-4.1-nano, roughly £8,657 versus £4,977 for 5 million documents. But cost doesn't exist in isolation.
| Factor | GPT-4.1-nano | GPT-5-nano |
|---|---|---|
| Cost (5M documents) | £4,977 | £8,657 |
| Processing time (5M docs) | ~34 days | ~25 days |
| Estimated names detected | ~10.6M | ~30.1M |
| Manual review required | High | Minimal |
| Compliance risk | Higher (missed PII) | Lower |
The additional £3,680 buys you 183% more names detected, 9 days faster processing, near-elimination of manual false positive review, and significantly reduced compliance risk. For an enterprise deployment where a single missed name redaction could trigger a regulatory issue, that's not a cost. It's insurance.
What's Next: Prompt Engineering for Even Better Results
One of the advantages of an LLM-based approach is that improvements don't require retraining models or changing infrastructure. They require better prompts. We've identified several targeted enhancements on our roadmap.
Phone number detection is currently the weakest area across all methods, with no approach achieving recall above 54%. We're developing a two-pass detection strategy that combines format matching with keyword-proximity scanning (looking for numbers near contextual cues like "Tel:", "Mob:", "Contact:"), along with maritime-specific jurisdiction formats for UK, Singapore, Hong Kong, Norway, and satellite phone numbers. We expect this to push phone recall from roughly 52% towards 65–75%.
Domain-specific priming is another lever. Adding explicit context to the system prompt, telling the model it's processing marine insurance documents, should further reduce name false positives by priming the model to expect vessel names, P&I club terminology, and classification society references.
Structured disambiguation guidance for edge cases, where an entity could be either a person or a vessel or organisation, can help the model resolve ambiguity more consistently through lightweight chain-of-thought reasoning.
The projected impact of these combined improvements would push overall F1 from 82.5 towards 86–89, with name precision climbing from 82.9% towards 88–92%. All through prompt updates alone.
The Takeaway
If you're building PII detection for a specialised domain, the instinct to reach for established NER tooling is understandable. These tools are mature, fast, and well-documented. But our testing showed they lack the one capability that matters most in complex domains: contextual reasoning.
The shift from pattern-matching to reasoning models isn't just an incremental improvement. It's a qualitative change in what automated PII detection can do. A system that understands "Lord Denning" is a judge and "Baltic Panther" is a ship isn't just more accurate. It's solving a fundamentally different problem than one that sees two capitalised words and guesses.
For our client, that meant the difference between a search index they could trust and one that would require endless manual correction. For the broader industry, it signals that domain-specific NLP tasks are increasingly better served by models that can reason about language, not just recognise patterns in it.
Ready to tackle PII detection in your own specialist domain?
We help organisations build intelligent search systems that handle sensitive data with the precision compliance demands. Whether you're working with marine insurance documents, legal corpora, or any domain where traditional NER tools fall short, we can help you find the right approach.
Get in touch to discuss your enterprise search requirements, or book a short call to walk through how reasoning-based PII detection could work for your data.
Topics Covered :
Author
Toyosi Babayeju