Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Kora\c{s}, Amin Dada, Julian Friedrich, Fran\c{c}ois Beaulieu, Paul Vozila, Jens Kleesiek

TL;DR
This paper introduces STM, a modular framework that improves biomedical retrieval by synthesizing training data, optimizing prompts, and merging models, achieving better performance without extensive retraining.
Contribution
The paper proposes STM, a novel approach combining synthetic data, prompt optimization, and model merging to adapt general LLMs into effective biomedical retrievers.
Findings
STM boosts task-specific performance by up to 23.5%.
Merged models outperform single experts and baselines.
The approach maintains general capabilities while excelling in biomedical tasks.
Abstract
Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
