Unsupervised Learning of Important Objects from First-Person Videos
Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

TL;DR
This paper introduces an unsupervised method for detecting important objects in first-person videos by jointly learning segmentation and recognition without requiring manual importance labels, using a novel Visual-Spatial Network architecture.
Contribution
The work presents a new unsupervised learning framework with a Visual-Spatial Network that detects important objects without human-provided importance annotations.
Findings
Achieves comparable or better results than supervised methods on two datasets.
Introduces a cross-pathway supervision scheme within the Visual-Spatial Network.
Demonstrates effective importance object detection without manual labels.
Abstract
A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
