ActionVOS: Actions as Prompts for Video Object Segmentation

Liangyang Ouyang; Ruicong Liu; Yifei Huang; Ryosuke Furuta; Yoichi; Sato

arXiv:2407.07402·cs.CV·July 11, 2024

ActionVOS: Actions as Prompts for Video Object Segmentation

Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, Yoichi, Sato

PDF

Open Access 1 Repo

TL;DR

This paper introduces ActionVOS, a new egocentric video object segmentation setting that uses human actions as prompts to better identify active objects and handle state changes, improving segmentation accuracy.

Contribution

It proposes a novel action-aware RVOS framework leveraging human actions as prompts, with a dedicated labeling module and loss function, advancing the understanding of active objects in egocentric videos.

Findings

01

Reduces mis-segmentation of inactive objects

02

Improves segmentation during object state changes

03

Enhances performance on challenging egocentric video datasets

Abstract

Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ut-vision/actionvos
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection