Can Vision Language Models Learn from Visual Demonstrations of Ambiguous   Spatial Reasoning?

Bowen Zhao; Leo Parker Dirac; Paulina Varshavskaya

arXiv:2409.17080·cs.CV·September 26, 2024

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Bowen Zhao, Leo Parker Dirac, Paulina Varshavskaya

PDF

Open Access 1 Repo

TL;DR

This paper introduces SVAT, a benchmark to test if vision-language models can learn new visuospatial concepts from visual demonstrations, revealing current limitations and potential improvements with curriculum learning.

Contribution

The paper proposes a new benchmark, SVAT, to evaluate VLMs' ability to learn from visual demonstrations and investigates methods to enhance their in-context learning capabilities.

Findings

01

VLMs fail to learn new visuospatial tasks zero-shot.

02

Finetuning alone does not significantly improve learning.

03

Curriculum learning with simpler data enhances ICL performance.

Abstract

Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

groundlight/vlm-visual-demonstrations
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization