Oops! Predicting Unintentional Action in Video

Dave Epstein; Boyuan Chen; Carl Vondrick

arXiv:1911.11206·cs.CV·November 27, 2019

Oops! Predicting Unintentional Action in Video

Dave Epstein, Boyuan Chen, Carl Vondrick

PDF

1 Repo 1 Video

TL;DR

This paper introduces a new dataset and tasks for recognizing unintentional actions in videos, compares machine and human performance, and explores self-supervised learning approaches to improve recognition accuracy.

Contribution

It provides the first dataset and benchmark suite for unintentional action recognition in videos, along with analysis of supervised and self-supervised methods.

Findings

01

Supervised neural networks perform below human consistency.

02

Self-supervised representations leveraging video speed are effective.

03

A significant gap remains between machine and human performance.

Abstract

From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains. The project website is available at https://oops.cs.columbia.edu

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvlab-columbia/oops
pytorch

Videos

Oops! Predicting Unintentional Action in Video· youtube

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings