Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
Kaifeng Wu, Junyan Wu, Qiang Liu, Jiarui Zhang, Wen Xu

TL;DR
This paper introduces a discriminative model based on Qwen3-0.6B for ultra-long document segmentation, supporting single-pass inputs of up to 13k tokens, with improved accuracy and efficiency over existing generative models.
Contribution
The paper presents a novel discriminative segmentation framework that handles ultra-long documents efficiently and accurately, with a new vector fusion method for downstream retrieval.
Findings
Outperforms generative models in macro-averaged F1 score.
Achieves two orders of magnitude faster inference.
Supports single-pass processing of up to 13,000 tokens.
Abstract
Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Handwritten Text Recognition Techniques · Text and Document Classification Technologies
