Joint Audio and Speech Understanding

Yuan Gong; Alexander H. Liu; Hongyin Luo; Leonid Karlinsky; James; Glass

arXiv:2309.14405·cs.SD·December 12, 2023

Joint Audio and Speech Understanding

Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James, Glass

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces LTU-AS, a novel machine learning model that integrates perception and reasoning modules to comprehensively understand both speech and non-speech audio signals, mimicking human audio perception.

Contribution

The paper presents the first unified model combining perception and reasoning for comprehensive audio understanding, integrating Whisper and LLaMA for the first time.

Findings

01

LTU-AS can recognize and understand speech, paralinguistics, and non-speech sounds simultaneously.

02

The model demonstrates advanced reasoning capabilities in audio perception tasks.

03

It achieves near-human level comprehension of complex audio signals.

Abstract

Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuanGongND/ltu
jaxOfficial

Models

🤗
speechbrain/speech-llm-LTU-AS-openasqa
model· 8 dl· ♡ 5
8 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing