OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

Hong Gao; Jingyu Wu; Xiangkai Xu; Kangni Xie; Yunchen Zhang; Bin Zhong; Xurui Gao; Min-Ling Zhang

arXiv:2511.16937·cs.CV·November 24, 2025

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

Hong Gao, Jingyu Wu, Xiangkai Xu, Kangni Xie, Yunchen Zhang, Bin Zhong, Xurui Gao, Min-Ling Zhang

PDF

Open Access

TL;DR

OmniGround is a new comprehensive benchmark for spatio-temporal video grounding that addresses current limitations by providing diverse, complex real-world data and a systematic evaluation framework, enabling better model robustness.

Contribution

The paper introduces OmniGround, a large-scale, diverse benchmark with a novel annotation pipeline and evaluation framework, advancing the assessment of models on real-world complex scenarios.

Findings

01

Models experience a 10.4% performance drop on complex scenes.

02

PG-TAF framework improves grounding accuracy by over 25%.

03

OmniGround enables more robust and realistic evaluation of STVG models.

Abstract

Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning