SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He; Qihuang Zhong; Juhua Liu; Bo Du; Peng Wang; Jing Zhang

arXiv:2511.20190·cs.CV·November 26, 2025

SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

PDF

Open Access

TL;DR

This paper introduces SFA, a training-free, Video-LLM-based framework that adaptively scans, focuses on, and amplifies key video regions to improve accuracy in Video TextVQA tasks, achieving state-of-the-art results.

Contribution

SFA is the first training-free, guidance-aware Video-LLM framework for Video TextVQA that enhances answer accuracy by focusing on relevant video cues.

Findings

01

Achieves state-of-the-art results on multiple Video TextVQA datasets.

02

Outperforms previous methods by a substantial margin.

03

Demonstrates strong generalizability across datasets.

Abstract

Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning