A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks
Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang

TL;DR
This paper introduces SDVP, a stepwise distillation strategy that enhances non-differentiable visual programming frameworks for visual reasoning by leveraging task-specific models, improving performance while preserving cross-task generalization.
Contribution
The paper proposes a novel stepwise distillation learning strategy for non-differentiable visual programming frameworks, enabling performance improvements without sacrificing cross-task generalization.
Findings
Significant performance gains on GQA and NLVRv2 benchmarks.
Effective knowledge transfer from small task-specific models to large VLMs.
Maintains performance on unseen and previous VR tasks.
Abstract
Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
