Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu; Qiushi Huang; Xinhao Mei; Haohe Liu; Qiuqiang Kong; Jianyuan; Sun; Shengchen Li; Tom Ko; Yu Zhang; Lilian H. Tang; Mark D. Plumbley; Volkan; K{\i}l{\i}\c{c}; Wenwu Wang

arXiv:2210.16428·eess.AS·May 30, 2023

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan, Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan, K{\i}l{\i}\c{c}, Wenwu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a visually-aware audio captioning system that leverages visual information and adaptive attention to improve the accuracy of describing ambiguous sounds in audio clips, achieving state-of-the-art results.

Contribution

It proposes a novel audio-visual attention mechanism and integrates visual features into audio captioning, enhancing recognition of ambiguous sounds.

Findings

01

Achieves state-of-the-art results on AudioCaps dataset.

02

Effectively integrates visual features to disambiguate sounds.

03

Improves captioning accuracy with adaptive audio-visual attention.

Abstract

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuxubo717/v-act
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media