TL;DR
MARS is a lightweight fine-tuning method that enables autoregressive models to predict multiple tokens per step, improving throughput and maintaining accuracy without architectural changes.
Contribution
MARS introduces a simple fine-tuning approach allowing AR models to generate multiple tokens per step with no performance loss and enhanced inference speed.
Findings
MARS matches or exceeds baseline accuracy on six benchmarks.
Achieves 1.5-1.7x throughput with multi-token generation.
Develops a block-level KV caching strategy for faster batch inference.
Abstract
Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
