Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies
Skatje Myers, Timothy A. Miller, Yanjun Gao, Matthew M. Churpek, Anoop Mayampurath, Dmitriy Dligach, Majid Afshar

TL;DR
This study systematically compares various embedding models and pooling strategies to optimize information retrieval from electronic health records, revealing that model choice and query formulation critically influence retrieval effectiveness in clinical NLP tasks.
Contribution
It provides an empirical ablation analysis of embedding models and pooling methods, highlighting the superior performance of a general-domain model (BGE) in clinical retrieval tasks.
Findings
BGE outperforms medical-specific models across datasets
Pooling strategies significantly affect retrieval performance
Model and query variability impact results and transferability
Abstract
Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece
