Towards Domain Specification of Embedding Models in Medicine
Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Hamidreza Mahyar

TL;DR
This paper introduces MEDTE, a comprehensive medical text embedding model trained on diverse data, and a new benchmark suite of 51 tasks to evaluate medical embeddings, demonstrating superior performance over existing models.
Contribution
It presents MEDTE, a robust medical text embedding model trained on diverse corpora, and a comprehensive benchmark suite tailored for medical NLP tasks, addressing current limitations.
Findings
MEDTE outperforms existing models across multiple medical NLP tasks.
The benchmark suite covers 51 diverse medical text tasks.
MEDTE demonstrates robustness and generalization in real-world applications.
Abstract
Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
