Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking

Jordi de la Torre

arXiv:2505.00810·cs.LG·November 18, 2025

Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking

Jordi de la Torre

PDF

TL;DR

This paper presents a scalable, hybrid system combining lexical, semantic, and transformer-based methods to harmonize inconsistent units in large clinical datasets, significantly improving accuracy and reducing manual effort.

Contribution

The authors introduce a novel multi-stage pipeline integrating BM25, sentence embeddings, Bayesian optimization, and a transformer reranker for unit harmonization in medical data.

Findings

01

Hybrid retrieval outperforms lexical-only and embedding-only methods.

02

Transformer reranker improves MRR by 0.10, achieving 0.9833 overall.

03

System achieves over 83% precision at rank 1 and 95% recall at rank 5.

Abstract

Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.