LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
Nikolaos Gkalelis, Vasileios Mezaris

TL;DR
This paper introduces a bottom-up cascaded knowledge distillation framework for vision-language models, improving efficiency and performance by gradually transferring knowledge through intermediate-capacity teachers.
Contribution
It proposes a novel cascaded distillation approach with intermediate teachers, enhancing knowledge transfer and model performance in vision-language tasks.
Findings
Achieves state-of-the-art results on seven VQA benchmarks.
Demonstrates improved knowledge transfer with cascaded distillation.
Provides theoretical analysis of generalization performance.
Abstract
Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
