Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as   Programmers

Aleksandar Stani\'c; Sergi Caelles; Michael Tschannen

arXiv:2401.01974·cs.CV·May 16, 2024·1 cites

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Aleksandar Stani\'c, Sergi Caelles, Michael Tschannen

PDF

Open Access

TL;DR

This paper introduces a framework that enables large language models to perform zero-shot compositional visual reasoning by automatically generating in-context examples, reducing human effort and improving robustness across various visual tasks.

Contribution

The work presents a novel framework that leverages abstract routines and automatic in-context example generation, enhancing LLMs' zero-shot visual reasoning capabilities without human-engineered prompts.

Findings

01

Consistent performance improvements across multiple visual reasoning tasks.

02

Reduced reliance on human-crafted in-context examples.

03

Enhanced robustness of LLMs as visual reasoning controllers.

Abstract

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsSparse Evolutionary Training