ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang; Kun-Yu Lin; Chaolei Tan; Jianguo Zhang; Wei-Shi Zheng; Jian-Fang Hu

arXiv:2501.14607·cs.CV·July 1, 2025

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu

PDF

Open Access 1 Models

TL;DR

ReferDINO is a novel video object segmentation model that combines vision-language grounding, pixel-level perception, and spatiotemporal reasoning, achieving state-of-the-art results with real-time speed.

Contribution

It introduces a new RVOS model integrating grounding-guided deformable decoding and temporal enhancement, advancing the combination of vision-language understanding and dense spatiotemporal reasoning.

Findings

01

Outperforms previous methods by +3.9% on Ref-YouTube-VOS

02

Achieves real-time inference at 51 FPS

03

Demonstrates significant improvements across five benchmarks

Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
liangtm/referdino
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsPruning