Visual Program Distillation with Template-Based Augmentation
Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem

TL;DR
This paper introduces a low-cost method for training small visual language models to generate specialized visual programs by using synthetic data augmentation with template-based decoupling, reducing annotation costs and inference time.
Contribution
It presents a novel template-based augmentation approach enabling small models to generate high-quality visual programs without human annotations.
Findings
Small models achieve high-quality program generation.
Synthetic augmentation reduces annotation costs.
Faster inference with small models.
Abstract
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsOpen Education and E-Learning · Model-Driven Software Engineering Techniques
MethodsSparse Evolutionary Training
