Spoken Moments: Learning Joint Audio-Visual Representations from Video   Descriptions

Mathew Monfort; SouYoung Jin; Alexander Liu; David Harwath; Rogerio; Feris; James Glass; Aude Oliva

arXiv:2105.04489·cs.CV·May 11, 2021

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio, Feris, James Glass, Aude Oliva

PDF

TL;DR

This paper introduces the Spoken Moments dataset with 500k spoken captions for videos, and proposes a novel contrastive learning method called Adaptive Mean Margin to improve video-caption retrieval and generalization.

Contribution

The paper presents a large-scale spoken caption dataset and a new contrastive learning approach that enhances video understanding and retrieval performance.

Findings

01

AMM improves retrieval accuracy across datasets.

02

Models trained on Spoken Moments outperform those trained on other datasets.

03

Spoken captions provide natural, concise descriptions for diverse videos.

Abstract

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning