ActionVOS: Actions as Prompts for Video Object Segmentation
Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, Yoichi, Sato

TL;DR
This paper introduces ActionVOS, a new egocentric video object segmentation setting that uses human actions as prompts to better identify active objects and handle state changes, improving segmentation accuracy.
Contribution
It proposes a novel action-aware RVOS framework leveraging human actions as prompts, with a dedicated labeling module and loss function, advancing the understanding of active objects in egocentric videos.
Findings
Reduces mis-segmentation of inactive objects
Improves segmentation during object state changes
Enhances performance on challenging egocentric video datasets
Abstract
Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection
