Repetition Improves Language Model Embeddings
Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan

TL;DR
This paper introduces 'echo embeddings', a method that transforms autoregressive language models into high-quality text embedding models without architectural changes or fine-tuning, by simply repeating inputs.
Contribution
The paper proposes a novel 'echo embeddings' technique that enables autoregressive models to produce strong text embeddings without modifying their architecture or additional training.
Findings
Echo embeddings outperform classical LM embeddings by over 5% in zero-shot settings.
They nearly match embeddings from bidirectional models that undergo additional training.
Echo embeddings perform well in supervised fine-tuning, matching or surpassing bidirectional models.
Abstract
Bidirectional models are considered essential for strong text embeddings. Recent approaches to adapt autoregressive language models (LMs) into strong text embedding models have largely had the requirement to modify the LM architecture to be bidirectional. We challenge this premise by introducing "echo embeddings" which converts autoregressive LMs into high quality text embedding models without changing the architecture or requiring fine-tuning. By repeating the input and extracting embeddings from the repeated tokens -- which have access to all original tokens -- echo embeddings improve over classical LM embeddings by over 5% in zero-shot settings. Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training. Echo embeddings are also compatible with supervised fine-tuning, matching or outperforming…
Peer Reviews
Decision·ICLR 2025 Poster
1. The echo embedding method is both easy and effective. While previous studies have demonstrated that repetition is beneficial for reasoning tasks and recurrent language models, this paper shows that it is also effective for causal language model embedding. 2. The paper is clearly written and easy to understand. 3. The use of a simple synthetic dataset to analyze why causal attention might inhibit embeddings from reliably capturing information across the entire context is interesting.
The echo embedding method will inevitably double the input length. Although experiments show that reducing the input length and training steps by half still yields good results, this approach may not be suitable in cases where important information is located in the latter half of the input context. For example, the S2 (Early redundant; late discriminatory) cases described in Section 3.1 of the paper. Additionally, because self-attention has a computational complexity of O(n^2) with respect to i
1. I appreciate the toy experiment, which clearly supports their claim about the limitation of classical embeddings and the advantages of echo embeddings. 2. The results on the MTEB dataset show clear improvements over classical embedding extraction settings, achieving comparable results with LLM2Vec, which needs backbone changes and unsupervised finetuning. 3. The method itself is very simple and insightful, requiring no changes to the backbone.
1. The setting of the most relevant baseline, promptEOL, does not seem to exactly align with that in the original paper. The results of PromptEOL appear significantly different from those reported in the original paper. In the original study, PromptEOL achieved an average score of 72.10 across seven STS tasks using the OPT-6.7B model. However, in your paper, PromptEOL only obtains an average of 67.14 on ten STS tasks. I didn't expect such a big performance discrepancy. Is this because of the thr
1. The proposed method is simple by repeating the input sentence twice to get the text embeddings. 2. The toy example design is interesting. 3. The results of zero-shot settings is impressive.
1. The motivation regarding causal attention seems questionable. Although LLM2Vec utilizes causal attention, it still performs exceptionally well in extracting text embeddings. 2. After fine-tuning the model, the performance gap between "echo embedding" and other models is minor. However, "echo embedding" requires the input sentence to be repeated twice, increasing computational costs. This limitation confines the proposed method to zero-shot settings only. 3. At least one illustrative example
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
