Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Shuaiyi Li; Zhisong Zhang; Yan Wang; Lei Zhu; Dongyang Ma; Chenlong Deng; Yang Deng; Wai Lam

arXiv:2605.15913·cs.CL·May 22, 2026

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

PDF

1 Models 4 Datasets

TL;DR

This paper introduces a semantic segmentation dataset and a block distillation training framework to enhance the generalization and efficiency of block attention in long-context NLP tasks.

Contribution

It presents a large dataset for automatic text segmentation and a novel distillation method that improves block attention performance without degrading accuracy.

Findings

01

The segmenter outperforms heuristic baselines in segmentation quality.

02

Block distillation achieves near-full-attention performance efficiently.

03

The methods improve long-context processing in multiple models.

Abstract

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Syon-Li/Qwen3-4B-Instruct-2507-Segmenter
model· 422 dl· ♡ 1
422 dl♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.