Joint Speech Recognition and Audio Captioning
Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi,, Michael Hentschel, Shinji Watanabe

TL;DR
This paper introduces joint modeling of speech recognition and audio captioning to improve understanding of noisy audio environments, creating a new dataset and demonstrating superior performance over separate models.
Contribution
It proposes novel end-to-end joint models for ASR and AAC, and develops a multi-task dataset combining speech and audio captions for evaluation.
Findings
Joint models outperform independent models in noisy environments.
Created a multi-task dataset with speech transcriptions and audio captions.
Demonstrated significant improvements over state-of-the-art methods.
Abstract
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
