Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Yubai Wei, Jiale Han, Yi Yang

TL;DR
This paper introduces BMEmbed, a method that adapts general-purpose text embedding models to private, domain-specific datasets using keyword-based retrieval signals, improving retrieval performance.
Contribution
The paper presents BMEmbed, a novel approach that leverages BM25 keyword retrieval to enhance embedding models for private datasets, addressing domain adaptation challenges.
Findings
Consistent improvements in retrieval accuracy across domains.
BM25 signals help align embeddings with domain-specific terminology.
Empirical analysis shows enhanced embedding quality through the method.
Abstract
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
