ACER: Automatic Language Model Context Extension via Retrieval
Luyu Gao, Yunyi Zhang, Jamie Callan

TL;DR
This paper introduces ACER, a method that enhances language models' long-context understanding by using retrieval-based data synthesis and self-tuning of short-context models, outperforming existing models in long-context tasks.
Contribution
ACER proposes an automatic data synthesis pipeline inspired by human retrieval, enabling task-specific long-context capabilities in language models without extensive task-specific data.
Findings
ACER outperforms generalist long-context models in retrieval tasks.
Synthetic data improves long-context reasoning abilities.
Self-tuning of short-context models enhances performance in complex tasks.
Abstract
Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data…
Peer Reviews
Decision·Submitted to ICLR 2025
- The topic of long-context modeling is both compelling and critical, and this paper provides valuable new insights into addressing this task. - The proposed method is well-conceived and alleviates the need for extensive resources for human-annotated data. - The approach demonstrates the potential for practical application, making it a meaningful contribution to long-context modeling research.
- My primary concern with this paper is the limited evaluation. The experiments provide only a narrow comparison of long-context benchmarks, such as Infinibench [1] and LongBench [2]. Additionally, the paper misses several important approaches mentioned in paper [3] such as self-extend [4] and lm-infinite [5], but lacks a comparative analysis or discussion of these methods, which would strengthen the evaluation. - There is also a lack of case studies and in-depth analysis of the model’s long-co
1. The paper is well-written. 2. The methodology is clear and effective.
1. The evaluation was conducted solely on long-context RAG tasks, where improvement is natural given the methodology. However, it was not assessed on more general long-context evaluation sets, such as LV-Eval and Needle in a Haystack. 2. The approach seems to be too simplistic and straightforward, lacking innovation and contribution. 3. Experiments are conducted on only one size and one type of language model.
An effective pipeline for automatically generating RAG training data, achieving significant performance improvements even with just an 8B model as a data generator. The scores for downstream RAG tasks are also impressive.
1. Most comparisons in the experiments are with Long-Context Models. I believe additional RAG strategies should be included for comparison. 2. It would be helpful to see some comparative data statistics, such as a table showing Big Context length and CoT Answer length from Figure 1, as well as length comparisons in the experimental section. 3. Figure 1 needs a higher-resolution version. Using a PDF image is recommended.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
