Visually Interpretable Subtask Reasoning for Visual Question Answering

Yu Cheng; Arushi Goel; Hakan Bilen

arXiv:2505.08084·cs.CV·May 14, 2025

Visually Interpretable Subtask Reasoning for Visual Question Answering

Yu Cheng, Arushi Goel, Hakan Bilen

PDF

TL;DR

VISTAR is a novel training framework that enhances interpretability and reasoning accuracy in multimodal large language models for visual question answering by generating structured subtask rationales.

Contribution

It introduces VISTAR, a subtask-driven fine-tuning approach that produces interpretable step-by-step reasoning within MLLMs without external models.

Findings

01

Improves reasoning accuracy on benchmark datasets

02

Maintains interpretability with structured rationales

03

Outperforms previous methods in efficiency and accuracy

Abstract

Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.