Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift
Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

TL;DR
This paper introduces HALD, a new training paradigm that combines hard and soft labels to mitigate local semantic drift caused by limited soft-label supervision, improving generalization in dataset distillation and classification tasks.
Contribution
It theoretically analyzes semantic drift under soft labels and proposes a hybrid approach, HALD, that effectively integrates hard labels to enhance training stability and accuracy.
Findings
HALD improves generalization on ImageNet-1K by 9.0% over prior methods.
Hybridizing hard and soft labels reduces semantic drift and distribution misalignment.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper identifies a concrete failure mode of low SLC distillation. Local crops do not always agree with the image-level target; - The proposed Soft Hard Soft schedule is simple and easy to add to existing dataset distillation pipelines; - Empirical results in the stated setting are clearly positive. The paper also attempts to give a theoretical view, which is not very common in this area.
### The research problem depends on a specific setting The whole story assumes that the image-level hard label is the only correct semantic anchor. Any crop level target that is different from it is treated as drift. This is one possible viewpoint, but not the only reasonable one. If we look from a local visual viewpoint, then many of the paper’s examples are no longer drift. A crop that shows only the ball in a human and ball image can reasonably receive the class ball from the teacher. In this
1. The paper points out the problem of local semantic drift in soft-label based dataset distillation with a clear motivation. 2. The paper also theoretically analyzes the importance of introducing hard labels into soft-label based dataset distillation, demonstrating its ability to improve optimization stability and generalization ability. The paper is well-reasoned and supported. 3. The performance of the proposed method compared with baseline under the same storage budget is good. 4. The articl
1. When SLC is set to a fixed value (e.g., 300), the performance of the original method (which requires augmentation and generation of corresponding soft labels in each epoch during downstream training) is degrade. However, SRe2L is improved compared to the original method. This phenomenon is strange (because according to Table 5, the performance actually decreases when SLC is reduced). 2. One question is that, according to the original method, methods like RDED can directly store the teacher mo
1. Clear Problem Identification: The paper clearly identifies, names, and illustrates (Fig. 1) a practical and important problem (LVSD) that arises from the very real constraint of soft-label storage costs. 2. Strong Empirical Results: When properly isolated (in the ablations), the HALD training schedule provides massive, consistent improvements over a Soft-Only baseline. Table 5 shows HALD boosts performance across all tested generation methods (e.g., +10.5% for FADRM on ImageNet-1K, IPC=10, S
1. Missing Key Baseline: The paper's novelty rests on its "Soft-Hard-Soft" schedule. However, it fails to compare against the most obvious and simpler baseline: a static combined loss (e.g., $\mathcal{L} = \mathcal{L}_{soft} + \lambda \mathcal{L}_{hard}$), which is related to prior work cited by the authors (e.g., GIFT). This makes it difficult to assess if the complex schedule is truly necessary. 2. The paper's true contribution is best isolated in Table 5, which clearly compares Soft-Only vs.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Data Stream Mining Techniques · Machine Learning and Data Classification
