Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski, Alexander Waibel

TL;DR
This paper introduces new benchmarks, methods, and models for paragraph segmentation in speech transcripts, establishing it as a standard task in speech processing.
Contribution
It provides the first benchmarks for paragraph segmentation in speech, proposes a constrained-decoding approach for large language models, and introduces a compact model achieving state-of-the-art accuracy.
Findings
Benchmarks for speech paragraph segmentation established
Proposed constrained-decoding for faithful paragraph insertion
MiniSeg model achieves state-of-the-art accuracy
Abstract
Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
