A Stepwise Distillation Learning Strategy for Non-differentiable Visual   Programming Frameworks on Visual Reasoning Tasks

Wentao Wan; Nan Kang; Zeqing Wang; Zhuojie Yang; Liang Lin; Keze Wang

arXiv:2309.09809·cs.CV·February 25, 2025

A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang

PDF

Open Access

TL;DR

This paper introduces SDVP, a stepwise distillation strategy that enhances non-differentiable visual programming frameworks for visual reasoning by leveraging task-specific models, improving performance while preserving cross-task generalization.

Contribution

The paper proposes a novel stepwise distillation learning strategy for non-differentiable visual programming frameworks, enabling performance improvements without sacrificing cross-task generalization.

Findings

01

Significant performance gains on GQA and NLVRv2 benchmarks.

02

Effective knowledge transfer from small task-specific models to large VLMs.

03

Maintains performance on unseen and previous VR tasks.

Abstract

Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques