Visual Program Distillation: Distilling Tools and Programmatic Reasoning   into Vision-Language Models

Yushi Hu; Otilia Stretcu; Chun-Ta Lu; Krishnamurthy Viswanathan; Kenji; Hata; Enming Luo; Ranjay Krishna; Ariel Fuxman

arXiv:2312.03052·cs.CV·April 8, 2024·1 cites

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji, Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman

PDF

Open Access

TL;DR

This paper introduces Visual Program Distillation (VPD), a framework that trains vision-language models to perform complex visual reasoning tasks efficiently by distilling reasoning from large language models into a single forward pass model.

Contribution

VPD is a novel instruction tuning method that enables a single forward pass vision-language model to solve complex visual tasks by distilling reasoning from large language models.

Findings

01

VPD improves reasoning abilities like counting and spatial understanding.

02

VPD-trained PaLI-X achieves state-of-the-art results on multiple benchmarks.

03

VPD enhances model factuality, consistency, and adaptation to real-world data.

Abstract

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Software Engineering Research · Domain Adaptation and Few-Shot Learning