InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal   Large Language Models

Cong Wei; Yujie Zhong; Haoxian Tan; Yingsen Zeng; Yong Liu; Zheng; Zhao; Yujiu Yang

arXiv:2412.14006·cs.CV·December 19, 2024

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng, Zhao, Yujiu Yang

PDF

Open Access 1 Repo

TL;DR

InstructSeg introduces a unified, end-to-end multi-modal large language model-based framework for visual segmentation that effectively handles both image and video tasks through multi-task training and advanced text-visual integration.

Contribution

This work unifies image and video visual segmentation under a single MLLM-based pipeline, leveraging multi-task training and novel text fusion techniques.

Findings

01

Outperforms existing segmentation models on diverse tasks

02

Effectively integrates global and detailed text information

03

Achieves superior results with a single unified model

Abstract

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

congvvc/instructseg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning