N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
Aleksander Lorenc, Fr\'ed\'eric Berdoz, Jo\"el Mathys, Roger Wattenhofer

TL;DR
N-vium introduces a mixture-of-exits transformer that enhances inference speed by parallelizing computation across depths, achieving significant wall-clock speedups without sacrificing model quality.
Contribution
It proposes a novel mixture-of-exits architecture with token-adaptive routing, enabling faster inference while maintaining exact sampling and model accuracy.
Findings
Largest model achieves 57.9% speedup over standard transformer.
Pretrained models up to 1.5B parameters with no perplexity loss.
Exact sampling and KV cache recovery are maintained.
Abstract
Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
