BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives
Aarush Sinha, Pavan Kumar S, Roshan Balaji, Nirav Pravinbhai Bhatt

TL;DR
This paper introduces BiCA, a biomedical dense retrieval method that leverages citation links to generate hard negatives, significantly improving retrieval performance in biomedical and scientific domains with minimal fine-tuning.
Contribution
BiCA utilizes citation-aware hard negatives from PubMed to enhance domain-specific dense retrieval models, demonstrating state-of-the-art results with minimal data and fine-tuning.
Findings
Improved zero-shot dense retrieval performance on BEIR and LoTTE datasets.
Citation-informed negatives lead to better domain adaptation and retrieval accuracy.
State-of-the-art results with minimal fine-tuning in biomedical retrieval tasks.
Abstract
Hard negatives are essential for training effective retrieval models. Hard-negative mining typically relies on ranking documents using cross-encoders or static embedding models based on similarity metrics such as cosine distance. Hard negative mining becomes challenging for biomedical and scientific domains due to the difficulty in distinguishing between source and hard negative documents. However, referenced documents naturally share contextual relevance with the source document but are not duplicates, making them well-suited as hard negatives. In this work, we propose BiCA: Biomedical Dense Retrieval with Citation-Aware Hard Negatives, an approach for hard-negative mining by utilizing citation links in 20,000 PubMed articles for improving a domain-specific small dense retriever. We fine-tune the GTE_small and GTE_Base models using these citation-informed negatives and observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Information Retrieval and Search Behavior · Topic Modeling
