SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

TL;DR
This paper introduces SFA, a training-free, Video-LLM-based framework that adaptively scans, focuses on, and amplifies key video regions to improve accuracy in Video TextVQA tasks, achieving state-of-the-art results.
Contribution
SFA is the first training-free, guidance-aware Video-LLM framework for Video TextVQA that enhances answer accuracy by focusing on relevant video cues.
Findings
Achieves state-of-the-art results on multiple Video TextVQA datasets.
Outperforms previous methods by a substantial margin.
Demonstrates strong generalizability across datasets.
Abstract
Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
