LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

Fei Yang; Xuanfan Ni; Renyi Yang; Jiahui Geng; Qing Li; Chenyang Lyu; Yichao Du; Longyue Wang; Weihua Luo; Kaifu Zhang

arXiv:2601.13539·cs.SD·January 21, 2026

LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

Fei Yang, Xuanfan Ni, Renyi Yang, Jiahui Geng, Qing Li, Chenyang Lyu, Yichao Du, Longyue Wang, Weihua Luo, Kaifu Zhang

PDF

Open Access 1 Datasets

TL;DR

LongSpeech is a comprehensive benchmark designed to evaluate speech models on long-duration audio tasks, highlighting current limitations and guiding future research in long-form speech understanding.

Contribution

We introduce LongSpeech, a large-scale, scalable benchmark with diverse annotations for multiple long-form speech tasks, and a pipeline for future extensions.

Findings

01

State-of-the-art models perform poorly on long speech tasks.

02

Models tend to specialize in individual tasks rather than generalize.

03

High-level reasoning remains challenging for current models.

Abstract

Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processing and reasoning over long-form audio. In this work, we present LongSpeech, a large-scale and scalable benchmark specifically designed to evaluate and advance the capabilities of speech models on long-duration audio. LongSpeech comprises over 100,000 speech segments, each approximately 10 minutes long, with rich annotations for ASR, speech translation, summarization, language detection, speaker counting, content separation, and question answering. We introduce a reproducible pipeline for constructing long-form speech benchmarks from diverse sources, enabling future extensions. Our initial experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AIDC-AI/Marco_Longspeech
dataset· 7.1k dl
7.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing