TL;DR
This paper introduces a domain transfer approach using web images and weak video labels to localize fine-grained actions in untrimmed videos, enabling effective training of action recognition models.
Contribution
It proposes a novel cross-domain transfer method leveraging noisy web images and weak labels to improve fine-grained action localization in videos.
Findings
Effective localization of actions using web images.
High accuracy on FGA-240 and THUMOS 2014 datasets.
Robust training with noisy, weakly labeled data.
Abstract
We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
