PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

TL;DR
PixelRefer introduces a unified framework for fine-grained, object-centric understanding in images and videos using a novel tokenization method, achieving high performance with reduced computational costs.
Contribution
The paper proposes PixelRefer, a novel region-level multimodal language model with a scale-adaptive tokenizer and an efficient variant, PixelRefer-Lite, for fine-grained visual understanding.
Findings
PixelRefer outperforms existing models on multiple benchmarks.
PixelRefer-Lite achieves competitive accuracy with lower computational cost.
The curated PixelRefer-2.2M dataset enhances instruction tuning.
Abstract
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
