CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora
Nathan Paull

TL;DR
CustomIR is an unsupervised framework that fine-tunes dense embeddings for specific document corpora using synthetic queries generated by large language models, significantly improving retrieval performance without human annotation.
Contribution
It introduces a novel unsupervised domain adaptation method for dense embeddings using synthetic query-document pairs generated by LLMs, reducing reliance on labeled data.
Findings
Small models improved Recall@10 by up to 2.3 points
Performance rivals larger models, enabling cheaper RAG deployments
Targeted synthetic fine-tuning enhances domain-specific retrieval effectiveness
Abstract
Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside their pre-training distribution. To address thi we introduce CustomIR, a framework for unsupervised adaptation of pre-trained language embedding models to domain-specific corpora using synthetically generated query-document pairs. CustomIR leverages large language models (LLMs) to create diverse queries grounded in a known target corpus, paired with LLM-verified hard negatives, eliminating the need for costly human annotation. Experiments on enterprise email and messaging datasets show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10. This performance increase allows these small models to rival the performance of much larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
