PreFT: Prefill-only finetuning for efficient inference
Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

TL;DR
PreFT introduces a prefill-only finetuning approach that enhances multi-adapter serving throughput in large language models with minimal performance loss, optimizing for efficiency.
Contribution
The paper proposes and releases PreFT, a novel prefill-only finetuning method that significantly improves inference throughput for multi-adapter LLM serving.
Findings
Serving multi-user PreFTs is 1.9x more efficient than traditional PEFTs.
PreFTs have higher evaluation loss than PEFTs on supervised tasks, but this can be mitigated by increasing rank.
PreFTs approach parity with PEFTs in reinforcement learning tasks.
Abstract
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
