Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
Aishik Nagar, Shantanu Jaiswal, Cheston Tan

TL;DR
This paper benchmarks zero-shot visual reasoning in vision-language models, revealing that textual scene descriptions improve performance over visual embeddings and that chain-of-thought prompting benefits only large models, highlighting current limitations.
Contribution
It introduces synthetic datasets for disentangling visual reasoning from world knowledge and systematically compares prompting methods and input modalities in VLMs.
Findings
Textual scene descriptions outperform visual embeddings in reasoning accuracy.
Chain-of-thought prompting benefits large models but not smaller ones.
Limitations remain in VLMs and LLMs for complex visual reasoning.
Abstract
Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used conflate "pure" visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying scene information as either…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Residual Connection · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Softmax · Chain-of-thought prompting · Linear Layer
