VoQA: Visual-only Question Answering

Jianing An; Luyang Jiang; Jie Luo; Wenjun Wu; Lei Huang

arXiv:2505.14227·cs.CV·December 2, 2025

VoQA: Visual-only Question Answering

Jianing An, Luyang Jiang, Jie Luo, Wenjun Wu, Lei Huang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VoQA introduces a new visual-only question answering task where models interpret questions embedded in images, emphasizing the need for vision-based reasoning without textual input, and proposes fine-tuning strategies to improve performance.

Contribution

This paper presents the VoQA task, a novel benchmark for vision-only question answering, and develops question-alignment fine-tuning methods to enhance model reasoning capabilities.

Findings

01

Models perform worse on VoQA than traditional VQA.

02

Fine-tuning improves vision-only reasoning accuracy.

03

VoQA training enhances cross-task generalization.

Abstract

Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luyangj/voqa
pytorchOfficial

Datasets

AJN-AI/VoQA
dataset· 478 dl
478 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques