A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

Hayk Stepanyan; Matthew McDermott

arXiv:2604.20853·cs.IR·April 24, 2026

A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

Hayk Stepanyan, Matthew McDermott

PDF

TL;DR

This paper empirically analyzes biomedical retrieval pipeline choices, providing practical guidance on optimizing performance and efficiency across various datasets and query types.

Contribution

It offers systematic insights into retrieval pipeline design, highlighting effective corpus aggregation, indexing strategies, and chunking methods for biomedical information retrieval.

Findings

01

Corpus aggregation improves retrieval quality.

02

MedRAG/pubmed is Pareto-optimal for biomedical retrieval.

03

FAISS indexing offers favorable speed-efficiency trade-offs.

Abstract

Retrieval systems are increasingly used in biomedical and clinical natural language processing applications, yet practical guidance for researchers building such systems is limited. In this work, we provide such guidance through an empirical study of how retrieval pipeline design choices affect performance and efficiency at scale. In particular, we examine retrieval over a variety of existing, public biomedical text datasets, leveraging a variety of disparate types of queries, including exam-style questions, conversational medical queries, community-asked questions, and non-question formulations across various retrieval pipeline settings spanning corpus selection, chunk granularity, and vector index configuration. Retrieval results are judged using a robust, win-rate comparison assessment via an LLM-as-a-judge setting with human validation. Across these experiments, we identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.