What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang; Pooneh Mousavi; Artem Ploujnikov; Mirco Ravanelli

arXiv:2409.14526·cs.SD·January 14, 2025

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli

PDF

Open Access 1 Repo

TL;DR

This paper introduces JASCO, a new benchmark and dataset for evaluating how well Auditory Large Language Models can perform joint audio-speech reasoning, highlighting their capabilities and limitations.

Contribution

The paper presents a novel joint audio-speech reasoning task, a new dataset, and insights into model behaviors across modalities, advancing multi-modal audio-speech processing research.

Findings

01

Models show varying dependence on audio and speech modalities.

02

JASCO benchmark reveals strengths and weaknesses of current ALLMs.

03

The dataset enables targeted evaluation of joint reasoning capabilities.

Abstract

In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BenoitWang/What_Are_They_Doing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsFocus