Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski; Alexander Waibel

arXiv:2512.24517·cs.CL·April 10, 2026

Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski, Alexander Waibel

PDF

TL;DR

This paper introduces new benchmarks, methods, and models for paragraph segmentation in speech transcripts, establishing it as a standard task in speech processing.

Contribution

It provides the first benchmarks for paragraph segmentation in speech, proposes a constrained-decoding approach for large language models, and introduces a compact model achieving state-of-the-art accuracy.

Findings

01

Benchmarks for speech paragraph segmentation established

02

Proposed constrained-decoding for faithful paragraph insertion

03

MiniSeg model achieves state-of-the-art accuracy

Abstract

Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.