Making Text Embedders Few-Shot Learners

Chaofan Li; MingHao Qin; Shitao Xiao; Jianlyu Chen; Kun Luo; Yingxia; Shao; Defu Lian; Zheng Liu

arXiv:2409.15700·cs.IR·September 25, 2024·2 cites

Making Text Embedders Few-Shot Learners

Chaofan Li, MingHao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia, Shao, Defu Lian, Zheng Liu

PDF

Open Access 1 Repo 3 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces a few-shot learning approach for text embedding generation using large language models' in-context learning capabilities, achieving state-of-the-art results on multiple benchmarks.

Contribution

The authors propose a novel method leveraging LLMs' in-context learning for high-quality text embeddings, demonstrating significant improvements over existing approaches.

Findings

01

Achieves new SOTA on MTEB and AIR-Bench benchmarks.

02

Simple retention of original framework yields best results.

03

Effectively utilizes task examples within the query for improved embeddings.

Abstract

Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-context learning (ICL) capabilities. This feature enables them to effectively handle both familiar and novel tasks by utilizing examples provided within their input context. Recognizing the potential of this capability, we propose leveraging the ICL feature in LLMs to enhance the process of text embedding generation. To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings. Our approach integrates task-related examples directly into the query side, resulting in significant improvements across various tasks. Additionally, we have investigated how to effectively utilize LLMs as embedding models, including various attention mechanisms, pooling methods, etc. Our findings suggest that retaining the original framework often yields the best…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

This is a very straight-forward paper, it is easy to read, it presents compelling, though not jaw-dropping results.

Weaknesses

- It would be nice to see an experiment on how the number of few-shot examples impacts performance. - The discussion about overlap between training data and MTEB was a bit difficult to follow.

Reviewer 02Rating 6Confidence 4

Strengths

- This method does appear to slightly improve the performance of text embedding models even at the 7B scale. - Novelty: In-context learning has been shown to be effective for language models in many scenarios; as far as I'm aware, this is the first work to explore in-context learning at the 7B scale. - The paper makes a number of other empirical contributions, analyzing factors such as bidirectional attention and pooling as well as details of whether instructions should be added to passages or q

Weaknesses

- In context examples have been shown to be most useful when a language model needs to learn a template or format for doing a task; in many cases, this is their *only* contribution (as opposed to actually teaching semantic information about the task at hand). This is not useful for embedding models because embedding models always have the same task format (outputting an embedding). - A major weakness is that this obviously makes the embedding process slower, and it's not clear by quite how much

Reviewer 03Rating 6Confidence 4

Strengths

- Simplicity and intuitiveness of the technique. - Strong and extensive empirical results - Thorough ablations I especially liked the ablation from Section 4.4, which shows that the simple ICL setup outperforms other architectural choices. Such ablations are necessary for disentangling the performance gains due to data and architectural changes.

Weaknesses

**The submission is not following the ICLR template because line numbers are missing** My main concern regarding the paper is that some key implementation details are missing (See Questions). Given the extra page limit and verbose Section 3.1 and Figure 2, I think the authors could have done a better job in covering those details. Other than that, NV-Embed2 results can also be included in the revision. Other comments about writing: - The right two subfigures in Figure 2 are not really req

Code & Models

Repositories

flagopen/flagembedding
pytorchOfficial

Models

Datasets

hanhainebula/bge-multilingual-gemma2-data
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Teaching and Learning Methods · Natural Language Processing Techniques · Student Assessment and Feedback

MethodsSoftmax · Attention Is All You Need