Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio, Feris, James Glass, Aude Oliva

TL;DR
This paper introduces the Spoken Moments dataset with 500k spoken captions for videos, and proposes a novel contrastive learning method called Adaptive Mean Margin to improve video-caption retrieval and generalization.
Contribution
The paper presents a large-scale spoken caption dataset and a new contrastive learning approach that enhances video understanding and retrieval performance.
Findings
AMM improves retrieval accuracy across datasets.
Models trained on Spoken Moments outperform those trained on other datasets.
Spoken captions provide natural, concise descriptions for diverse videos.
Abstract
When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
