TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

Jaber Jaber; Osama Jaber

arXiv:2603.21365·cs.LG·March 24, 2026

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

Jaber Jaber, Osama Jaber

PDF

Open Access

TL;DR

TIDE introduces a post-training method that attaches lightweight routers to large language models, enabling per-token early exit during inference to significantly improve efficiency without retraining.

Contribution

It presents a novel, model-agnostic, post-training system for early token exit in LLMs, reducing latency and increasing throughput with minimal additional code.

Findings

01

Achieves 7.2% latency reduction and 6.6% throughput increase on A100 GPU.

02

98-99% of tokens exit early during autoregressive decoding.

03

Calibrates quickly, producing a compact router checkpoint in under 3 minutes.

Abstract

Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Software-Defined Networks and 5G · Natural Language Processing Techniques