Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang; Qirui Chen; Cilin Yan; Jiayin Cai; Xiaolong Jiang; Yao Hu; Weidi Xie; Stratis Gavves

arXiv:2507.19599·cs.CV·July 29, 2025

Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves

PDF

Open Access

TL;DR

This paper introduces a new VideoLLM model with object referring and grounding capabilities, a novel spatial-temporal overlay module, and a curated dataset, achieving state-of-the-art results across multiple video understanding benchmarks.

Contribution

The paper presents a novel VideoLLM with object-centric reasoning, a spatial-temporal overlay module, and a new dataset, advancing multimodal video understanding and interaction.

Findings

01

Outperforms baselines on 12 benchmarks across 6 tasks

02

Demonstrates robustness in object-centric video reasoning

03

Enables multimodal, multiround video interactions

Abstract

Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling