Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
Tanzila Rahman, Bicheng Xu, Leonid Sigal

TL;DR
This paper demonstrates that audio signals in videos contain significant information for dense event captioning, and combining audio with visual data improves performance beyond state-of-the-art unimodal methods.
Contribution
The paper introduces a multi-modal approach leveraging audio and visual data for weakly-supervised dense event captioning, showing audio's surprising effectiveness and its complementarity to visual information.
Findings
Audio alone nearly matches visual model performance.
Combining audio and visual data outperforms unimodal methods.
Extensive experiments validate the effectiveness of the multi-modal approach.
Abstract
Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
