FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei,Hui-Ling Zhen,Xianzhi Yu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu

arXiv:2605.09932·cs.CL·May 12, 2026

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei,Hui-Ling Zhen,Xianzhi Yu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu

PDF

1 Repo

TL;DR

FocuSFT introduces a bilevel optimization method that enhances long-context learning in language models by focusing attention on relevant content during fine-tuning.

Contribution

It proposes a novel training framework that reduces attention dilution and improves long-context understanding in language models.

Findings

01

Up to +14pp accuracy improvement on BABILong at 32K context length.

02

Increases CWE aggregation from 72.9% to 81.1% at 16K on RULER.

03

Reduces attention sink mass by 529× and triples context engagement during training.

Abstract

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JarvisPei/FocuSFT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.