TL;DR
FocuSFT introduces a bilevel optimization method that enhances long-context learning in language models by focusing attention on relevant content during fine-tuning.
Contribution
It proposes a novel training framework that reduces attention dilution and improves long-context understanding in language models.
Findings
Up to +14pp accuracy improvement on BABILong at 32K context length.
Increases CWE aggregation from 72.9% to 81.1% at 16K on RULER.
Reduces attention sink mass by 529× and triples context engagement during training.
Abstract
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
