A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments

Mohammed Omer Shakeel Ahmed

arXiv:2603.04595·cs.LG·March 27, 2026

A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments

Mohammed Omer Shakeel Ahmed

PDF

Open Access

TL;DR

This paper introduces a privacy-preserving, multimodal AI framework for deduplication in healthcare data that combines textual, behavioral, and device data using late fusion and clustering, avoiding reliance on sensitive identifiers.

Contribution

It presents a novel scalable multimodal AI approach utilizing late fusion and clustering for privacy-compliant duplicate detection in healthcare data.

Findings

01

Achieved good F1-score in duplicate detection

02

Effectively handled data variations and noise

03

Demonstrated privacy-preserving capabilities

Abstract

Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Machine Learning in Healthcare