Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan; Wentong Li; Jian Liu; Dongqi Tang; Xinjie Luo; Chi Qin; Lei Zhang; Jianke Zhu

arXiv:2312.10032·cs.CV·September 9, 2025·2 cites

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu

PDF

Open Access 2 Repos 1 Datasets

TL;DR

Osprey introduces a mask-text instruction tuning method for vision-language models, enabling pixel-level understanding by incorporating fine-grained mask regions and a large mask-based dataset, enhancing region understanding capabilities.

Contribution

The paper presents a novel mask-text instruction tuning approach and a large dataset, advancing pixel-wise visual understanding in multimodal large language models.

Findings

01

Osprey outperforms existing models in region understanding tasks.

02

It can be integrated with SAM for multi-granularity semantics.

03

Demonstrates effective fine-grained pixel-level visual understanding.

Abstract

Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

AntGroup-MI/Osprey-724K
dataset· 71 dl
71 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications

MethodsFocus · Contrastive Language-Image Pre-training