Learning Multimodal Representations for Unseen Activities

AJ Piergiovanni; Michael S. Ryoo

arXiv:1806.08251·cs.CV·July 8, 2020

Learning Multimodal Representations for Unseen Activities

AJ Piergiovanni, Michael S. Ryoo

PDF

1 Repo

TL;DR

This paper introduces a novel multimodal embedding approach that leverages paired and unpaired text-video data, enhanced by adversarial training, to recognize and describe unseen activities in videos, advancing zero-shot learning capabilities.

Contribution

It proposes a new joint embedding method with adversarial training that effectively utilizes unpaired data for unseen activity recognition.

Findings

01

Improved zero-shot activity classification accuracy.

02

Enhanced unsupervised activity discovery.

03

Better unseen activity captioning results.

Abstract

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos. We first compare the effect of placing various constraints on the embedding space using paired text and video data. We also propose a method to improve the joint embedding space using an adversarial formulation, allowing it to benefit from unpaired text and video data. By using unpaired text data, we show the ability to learn a representation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning, outperforming the state-of-the-arts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

piergiaj/mlb-youtube
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.