Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang; Heng Lu; Xueyao Zhang; Shujie Liu; Yan Lu; Jinyu Li; Zhizheng Wu

arXiv:2601.05543·cs.CL·April 21, 2026

Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu

PDF

TL;DR

This paper introduces TARS, a reinforcement-learning framework that reduces the reasoning performance gap between speech and text inputs in large language models, achieving state-of-the-art results.

Contribution

The paper presents a novel reinforcement-learning approach with dual alignment signals to improve speech modality reasoning in large language models.

Findings

01

Significantly narrows the modality reasoning gap.

02

Achieves state-of-the-art performance among 7B-scale Speech LLMs.

03

Effective on benchmarks MMSU and OBQA.

Abstract

Although Speech Large Language Models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.