TL;DR
This paper introduces Dynamic Connection Masking (DCM), a novel regularization technique for neural networks that adaptively masks less important connections to improve robustness against noisy labels, outperforming existing methods.
Contribution
The paper proposes DCM, a new regularization mechanism inspired by sparsity regularization in KANs, enhancing noise robustness in neural networks and exploring KANs as classifiers against noisy labels.
Findings
DCM improves robustness of neural networks to noisy labels.
KAN classifiers outperform MLPs in noisy label scenarios.
Extensive experiments show DCM surpasses state-of-the-art methods.
Abstract
Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels can cause significant performance degradation. Existing research on mitigating the negative effects of noisy labels has mainly focused on robust loss functions and sample selection, with comparatively limited exploration of regularization in model architecture. Inspired by the sparsity regularization used in Kolmogorov-Arnold Networks (KANs), we propose a Dynamic Connection Masking (DCM) mechanism for both Multi-Layer Perceptron Networks (MLPs) and KANs to enhance the robustness of classifiers against noisy labels. The mechanism can adaptively mask less important edges during training by evaluating their information-carrying capacity. Through theoretical analysis, we demonstrate its efficiency in reducing gradient error. Our approach…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper proposed a novel perspective from model architecture for mitigating label noise singal. 2. The paper is well-written and clearly presented. 3. The method design sounds resaonable.
1. Instead of KAN network, experiments should introduce extra trials on CLIP (the transformer structure based network). 2. More ablation studies are required. - In equation 2, the importance score is computed among the batch scale. Whether a larger value of $B$ is important for accurate estimation and stable training? - The selection of hyper-parameter $p$. 3. Why adopting both training and testing mask will lead to worse performance compared to only training mask? Intuitively, the parameter
- The paper is clearly written and easy to follow.
- Lack of theoretical motivation: There is neither theoretical analysis nor empirical validation to justify the claimed relationship between edge importance scoring and noisy-label learning. - Lack of empirical investigation: The authors should conduct experiments under normal training settings without applying DCM, record the edge importance scores throughout training, and validate whether the edges that DCM intends to mask are connected to noisy labels. - Novelty concern: The idea of assigni
The proposed method is novel. While selective parameter updates have been explored in fields such as multi-task learning and continual learning, to the best of my knowledge, this is the first work to apply such a mechanism within the LNL domain. Moreover, leveraging the batch-level standard deviation to measure activation importance is a highly reasonable design choice that aligns well with the noise characteristics of LNL settings. Although some of the experimental improvements are modest, the
The introduction of Kolmogorov–Arnold Networks (KAN) is intriguing, but its connection to the main contribution of the paper is not clearly articulated. It remains unclear whether integrating DCM with KAN provides additional benefits or merely serves as a separate demonstration. If it is the former, an ablation study isolating the effects of KAN would substantially strengthen the argument. Additionally, the proposed approach appears highly sensitive to batch size. Since the method computes impo
- The paper is written nice and the presentation is clear for following along the method and the results. - The paper proposes a simple and robust technique that handles noisy labels in real world datasets. - The technique is scalable since the proposed network regularizations are applied on the FC layer of the network. - A nice novel network regularization method as opposed to the dominant SOTA world with loss functions to deal with noisy labeled data.
- Dropout kind of SOTA methods generally not applied during inference time, for DCM, it is unclear if it is applied during inference or not. - The edge importance scoring (Eq. 1-- 2) and masking (Eq 3 -- 4) formulations are easy to understand when applied on an FC layer and the formulations are presented with that use case. However, it is unclear how these formulations will transform when applied for a KAN. - There is enough evidence for synthetic noise and a couple of real-world noise benchmark
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
