Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stani\'c, Sergi Caelles, Michael Tschannen

TL;DR
This paper introduces a framework that enables large language models to perform zero-shot compositional visual reasoning by automatically generating in-context examples, reducing human effort and improving robustness across various visual tasks.
Contribution
The work presents a novel framework that leverages abstract routines and automatic in-context example generation, enhancing LLMs' zero-shot visual reasoning capabilities without human-engineered prompts.
Findings
Consistent performance improvements across multiple visual reasoning tasks.
Reduced reliance on human-crafted in-context examples.
Enhanced robustness of LLMs as visual reasoning controllers.
Abstract
Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsSparse Evolutionary Training
