Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon

TL;DR
This paper presents a novel framework for grounded video question answering that introduces a trigger moment concept to improve spatio-temporal grounding and tracking, significantly enhancing model performance.
Contribution
The paper proposes a new trigger moment approach using the CORTEX prompt to improve grounding and tracking in multimodal video reasoning models.
Findings
Achieved HOTA score of 0.4968, surpassing previous best of 0.2704.
Decomposed GVQA into three stages: reasoning, grounding, and tracking.
Introduced trigger moment concept for robust object anchoring.
Abstract
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
