TL;DR
This paper introduces a new dataset and tasks for recognizing unintentional actions in videos, compares machine and human performance, and explores self-supervised learning approaches to improve recognition accuracy.
Contribution
It provides the first dataset and benchmark suite for unintentional action recognition in videos, along with analysis of supervised and self-supervised methods.
Findings
Supervised neural networks perform below human consistency.
Self-supervised representations leveraging video speed are effective.
A significant gap remains between machine and human performance.
Abstract
From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains. The project website is available at https://oops.cs.columbia.edu
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Oops! Predicting Unintentional Action in Video· youtube
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
