TL;DR
This paper introduces a hierarchical knowledge distillation framework with semantic conditioning to improve the stability and accuracy of point-supervised infrared small target detection, leveraging a frozen Vision Foundation Model.
Contribution
It proposes a novel bilevel optimization approach with semantic-conditioned affine modulation and collaborative learning to enhance pseudo-label quality and training stability.
Findings
Consistent improvements in detection accuracy across multiple backbones.
Enhanced training stability with pseudo-label noise mitigation.
Effective use of a frozen Vision Foundation Model as a semantic prior.
Abstract
Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
