MedEIR: A Specialized Medical Embedding Model for Enhanced Information Retrieval

Anand Selvadurai; Jasheen Shaik; Girish Chandrasekar; ShriRadhaKrishnan Balamurugan; Eswara Reddy

arXiv:2505.13482·cs.IR·May 21, 2025

MedEIR: A Specialized Medical Embedding Model for Enhanced Information Retrieval

Anand Selvadurai, Jasheen Shaik, Girish Chandrasekar, ShriRadhaKrishnan Balamurugan, Eswara Reddy

PDF

Open Access

TL;DR

MedEIR is a new medical embedding model with a joint tokenizer and ALiBi-based long-context support, achieving superior performance on diverse NLP benchmarks with fewer pre-training tokens.

Contribution

Introduces MedEIR, a versatile embedding model optimized for medical and general NLP tasks, with innovative long-context processing and efficient training.

Findings

01

Outperforms Jina V2 and MiniLM on multiple benchmarks

02

Supports sequences up to 8,192 tokens with ALiBi-based processing

03

Achieves top scores on medical and general NLP datasets

Abstract

Embedding models have become essential for retrieval-augmented generation (RAG) tasks, semantic clustering, and text re-ranking. But despite their growing use, many of these come with notable limitations. For example, Jina fails to capture the semantic content of medical documents, while models such as MiniLM often perform poorly on long-form documents. Domain-adapted models, while specialized, often underperform in general-purpose tasks, reducing their overall applicability. General-domain tokenizers often misinterpret medical vocabulary. The limitations of current embedding models, whether in tokenization accuracy, domain comprehension, or handling long sequences, highlight the need for more versatile solutions. In this work, we present MedEIR, a novel embedding model and tokenizer jointly optimized for both medical and general NLP tasks, incorporating ALiBi-based long-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare