RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference
Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Haibo Hu, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

TL;DR
RAEE introduces a retrieval-augmented framework for early exit in large language models, improving inference efficiency and robustness by leveraging similar data for better exit decisions without significant training overhead.
Contribution
The paper proposes RAEE, a novel retrieval-augmented early exit method that enhances model performance and robustness during inference without extensive retraining.
Findings
Accelerates inference across multiple tasks
Maintains robust zero-shot performance
Reduces training overhead for early exit methods
Abstract
Deploying large language model inference remains challenging due to their high computational overhead. Early exit optimizes model inference by adaptively reducing the number of inference layers. Current methods typically train internal classifiers or use heuristic methods to determine the exit layer. However, those methods either introduce significant training overheads or lead to performance degradation. To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exit framework that not only enables early exit but also enhances model performance through corrective exit information at intermediate layers. This paper first demonstrates that the early exit problem can be effectively modeled as a distribution prediction problem, in which the distribution can be further approximated through the exit information of similar data. Subsequently, this paper…
Peer Reviews
Decision·ICLR 2026 Poster
1. The method improves the efficiency of a model by early exiting where the decision to make an exit is based on the analysis of similar previously seen samples helping in training free-exiting. 2. The method utilises the data characteristics like word embedding to map the incoming sample to the existing database and then directly assigning the layer to exit based on the samples in database to which the incoming sample closely resembles.
1. While authors claim that the method is zero-shot while they use the training samples to create the database making the method supervised, this claim makes the proposed method confusing. 2. The provided method cannot be easily generalised to other tasks except classification which is a major issue with this work. 3. Why there are different baselines with different backbone, this makes it unfair for a fair comparison all methods should be tested on the same underlying backbone model. For inst
1. Originality: Proposes a novel perspective—early exit as a corrective mechanism—and implements it via retrieval augmentation, avoiding the need for trainable classifiers. 2. Quality: Comprehensive experiments across 8 tasks and 4 model families; includes ablation on $k$, database size, and OOD generalization. 3. Clarity: Figures 1–2 and Algorithms C.1–C.2 make the method transparent. The “correct ratio” analysis (Figure 1b) is particularly compelling. 4. Significance: Offers a practical, zero-
1. Dependency on in-distribution labeled data: RAEE requires access to the **training set with ground-truth labels** to build the database. This limits applicability in zero-shot or unsupervised settings. The paper acknowledges this but does not explore alternatives (e.g., using pseudo-labels or confidence-based proxies). 2. Inconsistent gains across tasks: Performance improvements vary widely (e.g., CoLA: +12.45%; SST-2: +1.03% with RoBERTa). The paper does not analyze **why**—e.g., whether tas
1. Using $k$NN-based retrieval method for early exiting is novel. 2. This paper is simple, effective, and intuitive, even without additional technical contributions. 3. Experimental results are good.
1. The paper lacks a visualization of early-exit layers. What proportion of data exits early at each layer for each dataset? 2. Why does your method introduce additional time overhead on T5 in Figure 3? 3. Is the proposed method effective for supervised methods? For example, for a T5 model fine-tuned on SST-2. 4. Section 4.5 may not be very solid. Could you provide the performance when using a fixed layer, for example, the six values from layers 27 to 32?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Natural Language Processing Techniques
MethodsEarly exiting using confidence measures
