Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking
Jordi de la Torre

TL;DR
This paper presents a scalable, hybrid system combining lexical, semantic, and transformer-based methods to harmonize inconsistent units in large clinical datasets, significantly improving accuracy and reducing manual effort.
Contribution
The authors introduce a novel multi-stage pipeline integrating BM25, sentence embeddings, Bayesian optimization, and a transformer reranker for unit harmonization in medical data.
Findings
Hybrid retrieval outperforms lexical-only and embedding-only methods.
Transformer reranker improves MRR by 0.10, achieving 0.9833 overall.
System achieves over 83% precision at rank 1 and 95% recall at rank 5.
Abstract
Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
