Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization

Seongjae Kang; Dong Bok Lee; Hyungjoon Jang; Sung Ju Hwang

arXiv:2505.07675·cs.LG·October 1, 2025

Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization

Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

PDF

1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces Dual-Head Optimization (DHO), a simple yet effective semi-supervised knowledge distillation method that leverages vision-language models' capabilities, resolving gradient conflicts and improving performance across multiple datasets.

Contribution

The paper proposes DHO, a novel dual-head approach that enhances knowledge distillation from vision-language models, enabling better feature learning with minimal extra computation.

Findings

01

DHO outperforms traditional KD baselines on 15 datasets.

02

DHO often surpasses teacher models with smaller students.

03

Achieves state-of-the-art results on ImageNet semi-supervised and out-of-distribution tasks.

Abstract

Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data. Recently, vision-language models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance that often surpasses SSL approaches due to their exceptional generalization capabilities. This gap motivates us to question: how can we effectively harness the powerful generalization capabilities of VLMs into task-specific models? Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but we identify that it suffers from gradient conflicts between supervised and distillation losses. To address this challenge, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal. We observe that DHO resolves gradient conflicts, enabling improved…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper presents a clear and intuitive solution to address gradient conflicts in semi-supervised KD. It is intuitive but effective to use separate head for supervised and distillation. It is also not significantly increasing the overhead(<10% parameter size increase, and negligible FLOPs change) 2. The paper contributed a Dual Head interpolation method that supports post-training hyperparameter tuning. It also provides the prove process that it can be used for emulation of SHO hyperparamet

Weaknesses

The paper provided a theoretical discussion of the gradient conflicts, but the discussion heavily rely on empirical findings. Also the theoretical discussion of why DHO resolves the gradient conflicts are also based on empirical results. It would be better to provide more insights on the reasoning. The paper mainly limited the discussion and experiments on the VLM semi-supervised KD on classification tasks. It could be better to explore detection or segmentation tasks for the next step.

Reviewer 02Rating 2Confidence 5

Strengths

- Clear paper writing and in-depth analysis on the gradients - Promising performance and consistent gain via extensive experiments.

Weaknesses

- The claim of gradient conflicts is still empirical. While the derivations and visualizations are helpful, a more formal theoretical analysis is still needed. Also, why you consider cosine similarity as the metric to determine the conflicts. How about dot-product, or just the sign value of gradients themselves. - The method seems to be too simple. Just splitting two objectives into two heads? Then, how can you ensure the gradients that reaches the deepest layer of the shared backbone is always

Reviewer 03Rating 4Confidence 4

Strengths

1. Clear and easy-to-follow writing: The paper is well organized, with a logical flow from motivation to methodology and results. The problem is clearly stated, and the proposed solution is introduced in a concise and coherent manner, making the paper easy to read and understand. 2. Comprehensive theoretical analysis: The authors provide solid theoretical reasoning for the proposed Dual-Head Optimization framework. The analysis of gradient conflicts between supervised and distillation losses is

Weaknesses

1. The proposed Dual-Head Optimization framework appears overly simple and provides limited new insights for the research community. The idea of using different heads for different losses is not novel and has already been explored in prior works such as which author mentioned SSKD and DHKD. Although the authors claim that "they do not target distillation from foundation models or combine predictions at inference", this distinction seems marginal and insufficient to establish substantial novelty.

Code & Models

Repositories

erjui/DHO
pytorchOfficial

Models

🤗
erjui/dho
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation