Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event   Captioning

Tanzila Rahman; Bicheng Xu; Leonid Sigal

arXiv:1909.09944·cs.CV·October 28, 2019

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Tanzila Rahman, Bicheng Xu, Leonid Sigal

PDF

TL;DR

This paper demonstrates that audio signals in videos contain significant information for dense event captioning, and combining audio with visual data improves performance beyond state-of-the-art unimodal methods.

Contribution

The paper introduces a multi-modal approach leveraging audio and visual data for weakly-supervised dense event captioning, showing audio's surprising effectiveness and its complementarity to visual information.

Findings

01

Audio alone nearly matches visual model performance.

02

Combining audio and visual data outperforms unimodal methods.

03

Extensive experiments validate the effectiveness of the multi-modal approach.

Abstract

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.