Exploring the Role of Audio in Video Captioning
Yuhan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar,, Heng Wang

TL;DR
This paper introduces an audio-visual framework for video captioning that leverages raw audio signals, proposes a modality balanced pre-training loss, and enhances cross-modal fusion, leading to improved performance without relying on text transcripts.
Contribution
It presents a novel modality balanced pre-training loss and new fusion mechanisms to better utilize audio in video captioning, outperforming existing methods on multiple datasets.
Findings
Significant performance improvements on four datasets.
Outperforms state-of-the-art methods on some metrics.
Effective use of raw audio signals enhances captioning quality.
Abstract
Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieved significant improvement, the audio modality is often ignored in video captioning. In this work, we present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR), we argue that learning with raw audio signals can be more beneficial, as audio has additional information including acoustic events, speaker identity, etc. Our contributions are twofold. First, we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality, since the ground truth (i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Subtitles and Audiovisual Media
MethodsFocus
