SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Chih-Kai Yang; Neo Ho; Yen-Ting Piao; Hung-yi Lee

arXiv:2505.13237·eess.AS·August 26, 2025

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee

PDF

Open Access 2 Repos 4 Datasets

TL;DR

SAKURA is a new benchmark designed to evaluate the multi-hop reasoning abilities of large audio-language models, revealing their current struggles in integrating speech and audio information for complex reasoning tasks.

Contribution

The paper introduces SAKURA, the first benchmark specifically assessing multi-hop reasoning in large audio-language models, highlighting a key limitation in their multimodal reasoning capabilities.

Findings

01

LALMs struggle with multi-hop reasoning despite correct information extraction.

02

Current models have a fundamental challenge in integrating speech/audio for reasoning.

03

SAKURA exposes critical limitations, guiding future research directions.

Abstract

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsFocus