Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo; Yoonki Cho; Junhyug Noh; Sung-eui Yoon

arXiv:2511.02182·cs.CV·November 5, 2025

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon

PDF

Open Access

TL;DR

This paper presents a novel framework for grounded video question answering that introduces a trigger moment concept to improve spatio-temporal grounding and tracking, significantly enhancing model performance.

Contribution

The paper proposes a new trigger moment approach using the CORTEX prompt to improve grounding and tracking in multimodal video reasoning models.

Findings

01

Achieved HOTA score of 0.4968, surpassing previous best of 0.2704.

02

Decomposed GVQA into three stages: reasoning, grounding, and tracking.

03

Introduced trigger moment concept for robust object anchoring.

Abstract

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling