PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan; Wenqiao Zhang; Xin Li; Shihao Wang; Kehan Li; Wentong Li; Jun Xiao; Lei Zhang; Beng Chin Ooi

arXiv:2510.23603·cs.CV·November 4, 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

PDF

4 Models 1 Datasets

TL;DR

PixelRefer introduces a unified framework for fine-grained, object-centric understanding in images and videos using a novel tokenization method, achieving high performance with reduced computational costs.

Contribution

The paper proposes PixelRefer, a novel region-level multimodal language model with a scale-adaptive tokenizer and an efficient variant, PixelRefer-Lite, for fine-grained visual understanding.

Findings

01

PixelRefer outperforms existing models on multiple benchmarks.

02

PixelRefer-Lite achieves competitive accuracy with lower computational cost.

03

The curated PixelRefer-2.2M dataset enhances instruction tuning.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

DAMO-NLP-SG/VideoRefer-700K
dataset· 571 dl
571 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.