Agentic Spatio-Temporal Grounding via Collaborative Reasoning
Heng Zhao, Yew-Soon Ong, Joey Tianyi Zhou

TL;DR
This paper introduces ASTG, a novel open-world, training-free framework for spatio-temporal video grounding that employs collaborative reasoning agents leveraging multimodal large language models to improve efficiency and performance.
Contribution
The paper proposes a self-guided, agent-based approach for spatio-temporal grounding that decouples reasoning and automates tube extraction without extensive supervision.
Findings
Outperforms existing weakly-supervised and zero-shot methods.
Achieves performance comparable to some fully-supervised approaches.
Enhances retrieval efficiency with dedicated visual memory and dialogue context.
Abstract
Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
