TL;DR
Res$^2$CLIP introduces a residual-to-residual alignment framework within CLIP to improve few-shot generalist anomaly detection, effectively handling fine-grained differences and preserving open-world generalization.
Contribution
It is the first to propose a residual-to-residual alignment approach that symmetrically bridges visual and text modalities within CLIP's residual space for anomaly detection.
Findings
Effective in multiple datasets for anomaly detection.
Addresses fine-grained normal feature differences.
Maintains CLIP's open-world generalization.
Abstract
Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, ResCLIP, the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
