CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora

Nathan Paull

arXiv:2510.21729·cs.IR·October 29, 2025

CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora

Nathan Paull

PDF

TL;DR

CustomIR is an unsupervised framework that fine-tunes dense embeddings for specific document corpora using synthetic queries generated by large language models, significantly improving retrieval performance without human annotation.

Contribution

It introduces a novel unsupervised domain adaptation method for dense embeddings using synthetic query-document pairs generated by LLMs, reducing reliance on labeled data.

Findings

01

Small models improved Recall@10 by up to 2.3 points

02

Performance rivals larger models, enabling cheaper RAG deployments

03

Targeted synthetic fine-tuning enhances domain-specific retrieval effectiveness

Abstract

Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside their pre-training distribution. To address thi we introduce CustomIR, a framework for unsupervised adaptation of pre-trained language embedding models to domain-specific corpora using synthetically generated query-document pairs. CustomIR leverages large language models (LLMs) to create diverse queries grounded in a known target corpus, paired with LLM-verified hard negatives, eliminating the need for costly human annotation. Experiments on enterprise email and messaging datasets show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10. This performance increase allows these small models to rival the performance of much larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.