VISA: Reasoning Video Object Segmentation via Large Language Models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang, Kang, Weidi Xie, Efstratios Gavves

TL;DR
This paper introduces ReasonVOS, a new task for video object segmentation that involves complex reasoning with world knowledge, and presents VISA, a model leveraging large language models for this purpose, supported by a large benchmark dataset.
Contribution
The paper proposes ReasonVOS, a novel segmentation task requiring reasoning, and introduces VISA, a multi-modal LLM-based model with a new benchmark dataset for training and evaluation.
Findings
VISA effectively handles complex reasoning in video segmentation.
VISA outperforms existing methods on multiple datasets.
The benchmark facilitates future research in reasoning-based segmentation.
Abstract
Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
