Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Xuting Liu; Daniel Alexander; Siva Kesava Reddy Kakarla; Behnaz Arzani; Vincent Liu

arXiv:2512.15705·cs.DC·December 18, 2025

Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu

PDF

Open Access

TL;DR

This paper introduces DREX, a dynamic rebatching system for early-exit large language models that improves inference throughput and guarantees output quality by reorganizing requests at each exit point.

Contribution

The paper proposes Dynamic Rebatching with a copy-free buffer and an EE-aware scheduler, enabling efficient, quality-preserving early-exit inference in LLMs.

Findings

01

DREX improves throughput by 2-12% over baseline methods.

02

DREX eliminates involuntary exits, ensuring output quality.

03

Efficient handling of missing KV cache with memory-efficient state-copying.

Abstract

Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Software System Performance and Reliability · Parallel Computing and Optimization Techniques