OverFill: Two-Stage Models for Efficient Language Model Decoding

Woojeong Kim; Junxiong Wang; Jing Nathan Yan; Mohamed Abdelfattah; Alexander M. Rush

arXiv:2508.08446·cs.AI·August 13, 2025

OverFill: Two-Stage Models for Efficient Language Model Decoding

Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush

PDF

Open Access

TL;DR

OverFill introduces a two-stage decoding approach for large language models, decoupling prefill and decode phases to optimize efficiency and accuracy, significantly reducing training data needs and improving performance.

Contribution

It proposes a novel two-stage model architecture that decouples prefill and decode stages, enhancing efficiency and accuracy in language model inference.

Findings

01

OverFill outperforms pruned models by over 79% in accuracy.

02

It matches the performance of models trained from scratch with less data.

03

Significant latency improvements with minimal overhead.

Abstract

Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications