S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Yuanbo Fang; Haoze Sun; Jun Liu; Tao Zhang; Zenan Zhou; Weipeng Chen; Xiaofen Xing; Xiangmin Xu

arXiv:2505.14438·cs.SD·May 21, 2025

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Yuanbo Fang, Haoze Sun, Jun Liu, Tao Zhang, Zenan Zhou, Weipeng Chen, Xiaofen Xing, Xiangmin Xu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

S2SBench is a new benchmark designed to measure and analyze the decline in reasoning and generation abilities of speech-to-speech large language models when processing audio, using diagnostic datasets and a pairwise perplexity-based evaluation protocol.

Contribution

This paper introduces S2SBench, the first benchmark specifically quantifying intelligence degradation in speech-to-speech LLMs, with diagnostic datasets and a novel evaluation protocol.

Findings

01

S2SBench effectively measures performance gaps in speech LLMs.

02

Application to Baichuan-Audio reveals insights into training dynamics.

03

Datasets and code are publicly available for further research.

Abstract

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

undobug/s2sbench
pytorchOfficial

Datasets

undobug/S2SBench
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques