Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski; Maike Z\"ufle; Thai Binh Nguyen; Jan Niehues; Alexander Waibel

arXiv:2602.08979·cs.SD·February 10, 2026

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski, Maike Z\"ufle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AudioSeg, an audio-only model for segmenting long-form audio, demonstrating its superiority over text-based methods and analyzing factors influencing performance, with formal evaluation protocols.

Contribution

It presents a novel audio-only architecture, comprehensive analysis of influencing factors, and formal evaluation protocols for audio chaptering.

Findings

01

AudioSeg outperforms text-based models in audio segmentation.

02

Pauses significantly improve segmentation accuracy.

03

Multimodal LLMs are limited by context length but promising for shorter audio.

Abstract

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

retkowski/ytseg
dataset· 945 dl
945 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis