Visual Program Distillation with Template-Based Augmentation

Michal Shlapentokh-Rothman; Yu-Xiong Wang; Derek Hoiem

arXiv:2412.08564·cs.CV·November 5, 2025

Visual Program Distillation with Template-Based Augmentation

Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem

PDF

Open Access 1 Video

TL;DR

This paper introduces a low-cost method for training small visual language models to generate specialized visual programs by using synthetic data augmentation with template-based decoupling, reducing annotation costs and inference time.

Contribution

It presents a novel template-based augmentation approach enabling small models to generate high-quality visual programs without human annotations.

Findings

01

Small models achieve high-quality program generation.

02

Synthetic augmentation reduces annotation costs.

03

Faster inference with small models.

Abstract

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Visual Program Distillation with Template-Based Augmentation· underline

Taxonomy

TopicsOpen Education and E-Learning · Model-Driven Software Engineering Techniques

MethodsSparse Evolutionary Training