Visual Set Program Synthesizer
Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun

TL;DR
This paper introduces a visual program synthesis approach for complex set-based reasoning in visual question answering, outperforming existing methods by generating explicit symbolic programs for transparent and accurate reasoning.
Contribution
It proposes a novel framework that treats visual reasoning as program synthesis, along with a new benchmark for evaluating set-based visual reasoning tasks.
Findings
Significantly outperforms state-of-the-art baselines on complex reasoning tasks.
Produces more systematic and transparent reasoning behavior.
Improves answer accuracy in visual question answering.
Abstract
A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
