InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng, Zhao, Yujiu Yang

TL;DR
InstructSeg introduces a unified, end-to-end multi-modal large language model-based framework for visual segmentation that effectively handles both image and video tasks through multi-task training and advanced text-visual integration.
Contribution
This work unifies image and video visual segmentation under a single MLLM-based pipeline, leveraging multi-task training and novel text fusion techniques.
Findings
Outperforms existing segmentation models on diverse tasks
Effectively integrates global and detailed text information
Achieves superior results with a single unified model
Abstract
Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
