Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

TL;DR
This paper introduces a semantic segmentation dataset and a block distillation training framework to enhance the generalization and efficiency of block attention in long-context NLP tasks.
Contribution
It presents a large dataset for automatic text segmentation and a novel distillation method that improves block attention performance without degrading accuracy.
Findings
The segmenter outperforms heuristic baselines in segmentation quality.
Block distillation achieves near-full-attention performance efficiently.
The methods improve long-context processing in multiple models.
Abstract
Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
