Dynamic Rebatching for Efficient Early-Exit Inference with DREX
Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu

TL;DR
This paper introduces DREX, a dynamic rebatching system for early-exit large language models that improves inference throughput and guarantees output quality by reorganizing requests at each exit point.
Contribution
The paper proposes Dynamic Rebatching with a copy-free buffer and an EE-aware scheduler, enabling efficient, quality-preserving early-exit inference in LLMs.
Findings
DREX improves throughput by 2-12% over baseline methods.
DREX eliminates involuntary exits, ensuring output quality.
Efficient handling of missing KV cache with memory-efficient state-copying.
Abstract
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Software System Performance and Reliability · Parallel Computing and Optimization Techniques
