VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Haibin He; Maoyuan Ye; Jing Zhang; Juhua Liu; Bo Du

arXiv:2605.04870·cs.CV·May 7, 2026

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du

PDF

TL;DR

This paper introduces VTAgent, a question-guided framework that explicitly anchors keyframes to improve evidence localization and significantly enhance performance on Video TextVQA benchmarks.

Contribution

The paper proposes a novel agentic keyframe anchoring method that outperforms existing approaches and establishes new state-of-the-art results in Video TextVQA.

Findings

01

Frame-wise question answering outperforms direct video inference.

02

Explicit keyframe anchoring improves accuracy by +12.12 on average.

03

The approach is effective with training-free, supervised fine-tuning, and reinforcement learning.

Abstract

Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.