Joint Audio and Speech Understanding
Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James, Glass

TL;DR
This paper introduces LTU-AS, a novel machine learning model that integrates perception and reasoning modules to comprehensively understand both speech and non-speech audio signals, mimicking human audio perception.
Contribution
The paper presents the first unified model combining perception and reasoning for comprehensive audio understanding, integrating Whisper and LLaMA for the first time.
Findings
LTU-AS can recognize and understand speech, paralinguistics, and non-speech sounds simultaneously.
The model demonstrates advanced reasoning capabilities in audio perception tasks.
It achieves near-human level comprehension of complex audio signals.
Abstract
Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
