A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments
Mohammed Omer Shakeel Ahmed

TL;DR
This paper introduces a privacy-preserving, multimodal AI framework for deduplication in healthcare data that combines textual, behavioral, and device data using late fusion and clustering, avoiding reliance on sensitive identifiers.
Contribution
It presents a novel scalable multimodal AI approach utilizing late fusion and clustering for privacy-compliant duplicate detection in healthcare data.
Findings
Achieved good F1-score in duplicate detection
Effectively handled data variations and noise
Demonstrated privacy-preserving capabilities
Abstract
Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Machine Learning in Healthcare
