VISA: Reasoning Video Object Segmentation via Large Language Models

Cilin Yan; Haochen Wang; Shilin Yan; Xiaolong Jiang; Yao Hu; Guoliang; Kang; Weidi Xie; Efstratios Gavves

arXiv:2407.11325·cs.CV·July 17, 2024·2 cites

VISA: Reasoning Video Object Segmentation via Large Language Models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang, Kang, Weidi Xie, Efstratios Gavves

PDF

Open Access 2 Repos

TL;DR

This paper introduces ReasonVOS, a new task for video object segmentation that involves complex reasoning with world knowledge, and presents VISA, a model leveraging large language models for this purpose, supported by a large benchmark dataset.

Contribution

The paper proposes ReasonVOS, a novel segmentation task requiring reasoning, and introduces VISA, a multi-modal LLM-based model with a new benchmark dataset for training and evaluation.

Findings

01

VISA effectively handles complex reasoning in video segmentation.

02

VISA outperforms existing methods on multiple datasets.

03

The benchmark facilitates future research in reasoning-based segmentation.

Abstract

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques