TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Jaber Jaber, Osama Jaber

TL;DR
TIDE introduces a post-training method that attaches lightweight routers to large language models, enabling per-token early exit during inference to significantly improve efficiency without retraining.
Contribution
It presents a novel, model-agnostic, post-training system for early token exit in LLMs, reducing latency and increasing throughput with minimal additional code.
Findings
Achieves 7.2% latency reduction and 6.6% throughput increase on A100 GPU.
98-99% of tokens exit early during autoregressive decoding.
Calibrates quickly, producing a compact router checkpoint in under 3 minutes.
Abstract
Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Software-Defined Networks and 5G · Natural Language Processing Techniques
