Dual Distillation for Few-Shot Anomaly Detection
Le Dong, Qinzhong Tan, Chunlei Li, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou

TL;DR
This paper introduces D$^2$4FAD, a dual distillation framework for few-shot anomaly detection in medical imaging, achieving state-of-the-art results with limited normal reference images across diverse organs and modalities.
Contribution
The paper proposes a novel dual distillation approach with a learn-to-weight mechanism for improved few-shot anomaly detection in medical images.
Findings
Outperforms existing methods on a large multi-organ benchmark
Achieves significant improvements in anomaly detection accuracy
Demonstrates robustness across different organs and imaging modalities
Abstract
Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear motivation and task definition for few-shot anomaly detection in medical settings. - Architecture is simple, fast, and avoids large generative models. - Strong image-level AUROC across multiple datasets and shot settings.
- The work repeatedly emphasizes “dual distillation” as a key contribution, but the process does not fully match established definitions of distillation in the literature. Since the teacher network is frozen, and the student is not learning logits or semantic knowledge but merely reconstructing features, the term distillation may be overstated. This weakens the conceptual positioning of the contribution: the method is an anomaly-detection reconstruction framework rather than a genuine knowledge-
1. The paper introduces a clear and well-motivated dual-distillation framework ($D^24FAD$) for few-shot anomaly detection, combining a teacher–student distillation mechanism with an additional student self-distillation path and a learn-to-weight module that adaptively re-weights support images conditioned on the query. While the core components (knowledge distillation, few-shot learning) are known, their integration into a unified few-shot medical anomaly detection framework is novel and concept
The main limitation of the paper lies in the formulation of the task. Although the work is presented as addressing few-shot anomaly detection, the evaluation is restricted to image-level AUROC, effectively turning the problem into a binary classification task (normal versus abnormal). While the model internally produces anomaly maps, no quantitative localization results are provided (e.g., Dice, IoU, or AUPRO). This simplification reduces the methodological complexity of the problem and limits t
Clear Motivation and Problem Formulation: The paper does an excellent job of motivating the need for few-shot anomaly detection in clinical practice, grounding the research in a real-world problem. The formalization of the FAD task is clear and precise. Elegant and Effective Method: The D²FAD framework is simple yet powerful. The dual distillation concept is intuitive and well-justified. By using a frozen pre-trained encoder as the teacher, the method is parameter-efficient and avoids the need
Limited Technical Depth in "Learn-to-Weight": While the "learn-to-weight" mechanism (Eq. 4) is a good idea, its presentation is somewhat brief. It is essentially a scaled dot-product attention between the query and support features. The paper could benefit from a deeper analysis or discussion of this component. For example, are there other ways to instantiate this weighting? How does this mechanism behave in practice (e.g., does it learn to ignore outlier-like support images)? Sensitivity to th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
