TL;DR
This paper introduces Dual-Head Optimization (DHO), a simple yet effective semi-supervised knowledge distillation method that leverages vision-language models' capabilities, resolving gradient conflicts and improving performance across multiple datasets.
Contribution
The paper proposes DHO, a novel dual-head approach that enhances knowledge distillation from vision-language models, enabling better feature learning with minimal extra computation.
Findings
DHO outperforms traditional KD baselines on 15 datasets.
DHO often surpasses teacher models with smaller students.
Achieves state-of-the-art results on ImageNet semi-supervised and out-of-distribution tasks.
Abstract
Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data. Recently, vision-language models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance that often surpasses SSL approaches due to their exceptional generalization capabilities. This gap motivates us to question: how can we effectively harness the powerful generalization capabilities of VLMs into task-specific models? Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but we identify that it suffers from gradient conflicts between supervised and distillation losses. To address this challenge, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal. We observe that DHO resolves gradient conflicts, enabling improved…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a clear and intuitive solution to address gradient conflicts in semi-supervised KD. It is intuitive but effective to use separate head for supervised and distillation. It is also not significantly increasing the overhead(<10% parameter size increase, and negligible FLOPs change) 2. The paper contributed a Dual Head interpolation method that supports post-training hyperparameter tuning. It also provides the prove process that it can be used for emulation of SHO hyperparamet
The paper provided a theoretical discussion of the gradient conflicts, but the discussion heavily rely on empirical findings. Also the theoretical discussion of why DHO resolves the gradient conflicts are also based on empirical results. It would be better to provide more insights on the reasoning. The paper mainly limited the discussion and experiments on the VLM semi-supervised KD on classification tasks. It could be better to explore detection or segmentation tasks for the next step.
- Clear paper writing and in-depth analysis on the gradients - Promising performance and consistent gain via extensive experiments.
- The claim of gradient conflicts is still empirical. While the derivations and visualizations are helpful, a more formal theoretical analysis is still needed. Also, why you consider cosine similarity as the metric to determine the conflicts. How about dot-product, or just the sign value of gradients themselves. - The method seems to be too simple. Just splitting two objectives into two heads? Then, how can you ensure the gradients that reaches the deepest layer of the shared backbone is always
1. Clear and easy-to-follow writing: The paper is well organized, with a logical flow from motivation to methodology and results. The problem is clearly stated, and the proposed solution is introduced in a concise and coherent manner, making the paper easy to read and understand. 2. Comprehensive theoretical analysis: The authors provide solid theoretical reasoning for the proposed Dual-Head Optimization framework. The analysis of gradient conflicts between supervised and distillation losses is
1. The proposed Dual-Head Optimization framework appears overly simple and provides limited new insights for the research community. The idea of using different heads for different losses is not novel and has already been explored in prior works such as which author mentioned SSKD and DHKD. Although the authors claim that "they do not target distillation from foundation models or combine predictions at inference", this distinction seems marginal and insufficient to establish substantial novelty.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
