Watch and Learn: Mapping Language and Noisy Real-world Videos with   Self-supervision

Yujie Zhong; Linhai Xie; Sen Wang; Lucia Specia; Yishu Miao

arXiv:2011.09634·cs.CV·January 12, 2021

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

Yujie Zhong, Linhai Xie, Sen Wang, Lucia Specia, Yishu Miao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised framework for aligning natural language with noisy real-world videos, utilizing adversarial learning to handle noise and a new dataset for training and evaluation.

Contribution

It proposes a novel adversarial self-supervised learning approach for cross-modal video-language mapping and introduces the 'ApartmenTour' dataset for benchmarking.

Findings

01

Achieves state-of-the-art results on bidirectional retrieval tasks.

02

Effectively handles noise in natural videos with the adversarial module.

03

Demonstrates superior performance over strong baselines.

Abstract

In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zyj-13/WAL
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques