Advancing Visual Large Language Model for Multi-granular Versatile Perception

Wentao Xiang; Haoxian Tan; Cong Wei; Yujie Zhong; Dengjie Li; Yujiu Yang

arXiv:2507.16213·cs.CV·July 23, 2025

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang

PDF

Open Access

TL;DR

This paper introduces MVP-LM, a versatile visual perception framework using a large language model that unifies multiple tasks like detection, segmentation, and grounding within a single architecture, enhancing adaptability and performance.

Contribution

The paper presents MVP-LM, a novel multi-granular, versatile perception framework that integrates various vision tasks with a unified decoder and dataset unification strategy.

Findings

01

Effective across diverse benchmarks

02

Unifies multiple perception tasks in a single model

03

Demonstrates superior performance over existing methods

Abstract

Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques