VoQA: Visual-only Question Answering
Jianing An, Luyang Jiang, Jie Luo, Wenjun Wu, Lei Huang

TL;DR
VoQA introduces a new visual-only question answering task where models interpret questions embedded in images, emphasizing the need for vision-based reasoning without textual input, and proposes fine-tuning strategies to improve performance.
Contribution
This paper presents the VoQA task, a novel benchmark for vision-only question answering, and develops question-alignment fine-tuning methods to enhance model reasoning capabilities.
Findings
Models perform worse on VoQA than traditional VQA.
Fine-tuning improves vision-only reasoning accuracy.
VoQA training enhances cross-task generalization.
Abstract
Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
