Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng, Arushi Goel, Hakan Bilen

TL;DR
VISTAR is a novel training framework that enhances interpretability and reasoning accuracy in multimodal large language models for visual question answering by generating structured subtask rationales.
Contribution
It introduces VISTAR, a subtask-driven fine-tuning approach that produces interpretable step-by-step reasoning within MLLMs without external models.
Findings
Improves reasoning accuracy on benchmark datasets
Maintains interpretability with structured rationales
Outperforms previous methods in efficiency and accuracy
Abstract
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
