What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli

TL;DR
This paper introduces JASCO, a new benchmark and dataset for evaluating how well Auditory Large Language Models can perform joint audio-speech reasoning, highlighting their capabilities and limitations.
Contribution
The paper presents a novel joint audio-speech reasoning task, a new dataset, and insights into model behaviors across modalities, advancing multi-modal audio-speech processing research.
Findings
Models show varying dependence on audio and speech modalities.
JASCO benchmark reveals strengths and weaknesses of current ALLMs.
The dataset enables targeted evaluation of joint reasoning capabilities.
Abstract
In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsFocus
