PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun; Aryaman Arora; Zhengxuan Wu; Dhruv Pai; Ben Keigwin; Dan Jurafsky; Christopher Potts

arXiv:2605.14217·cs.LG·May 15, 2026

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

PDF

TL;DR

PreFT introduces a prefill-only finetuning approach that enhances multi-adapter serving throughput in large language models with minimal performance loss, optimizing for efficiency.

Contribution

The paper proposes and releases PreFT, a novel prefill-only finetuning method that significantly improves inference throughput for multi-adapter LLM serving.

Findings

01

Serving multi-user PreFTs is 1.9x more efficient than traditional PEFTs.

02

PreFTs have higher evaluation loss than PEFTs on supervised tasks, but this can be mitigated by increasing rank.

03

PreFTs approach parity with PEFTs in reinforcement learning tasks.

Abstract

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.