Two-dimensional early exit optimisation of LLM inference
Jan H\r{u}la, David Adamczyk, Tom\'a\v{s} Filip, Martin Pavl\'i\v{c}ek, Petr Sos\'ik

TL;DR
This paper proposes a novel two-dimensional early exit method for large language model inference, combining layer-wise and sentence-wise strategies to significantly reduce computation while maintaining accuracy.
Contribution
The paper introduces a 2D early exit approach that coordinates layer and sentence exits, achieving greater efficiency than existing methods across multiple LLMs and tasks.
Findings
Achieves 1.4--2.3× speed-up over layer-wise early exit.
Effective on multiple state-of-the-art LLMs and sentiment datasets.
Graceful degradation on complex multi-class problems.
Abstract
We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3 over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
