LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis; Vasileios Mezaris

arXiv:2605.10641·cs.CV·May 12, 2026

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

PDF

TL;DR

This paper introduces a bottom-up cascaded knowledge distillation framework for vision-language models, improving efficiency and performance by gradually transferring knowledge through intermediate-capacity teachers.

Contribution

It proposes a novel cascaded distillation approach with intermediate teachers, enhancing knowledge transfer and model performance in vision-language tasks.

Findings

01

Achieves state-of-the-art results on seven VQA benchmarks.

02

Demonstrates improved knowledge transfer with cascaded distillation.

03

Provides theoretical analysis of generalization performance.

Abstract

Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.