Learning to Be a Transformer to Pinpoint Anomalies
Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano

TL;DR
This paper introduces a Teacher-Student framework using high-resolution images and pre-trained features to improve anomaly detection and segmentation, especially for tiny defects, with faster processing and state-of-the-art results.
Contribution
The paper presents a novel Teacher-Student paradigm leveraging pre-trained vision Transformers and shallow MLPs to enhance high-resolution anomaly detection and segmentation.
Findings
Achieves state-of-the-art performance on MVTec AD.
Runs significantly faster than existing methods.
Excels at detecting both large and tiny anomalies.
Abstract
To efficiently deploy strong, often pre-trained feature extractors, recent Industrial Anomaly Detection and Segmentation (IADS) methods process low-resolution images, e.g., 224x224 pixels, obtained by downsampling the original input images. However, while numerous industrial applications demand the identification of both large and small defects, downsampling the input image to a low resolution may hinder a method's ability to pinpoint tiny anomalies. We propose a novel Teacher--Student paradigm to leverage strong pre-trained features while processing high-resolution input images very efficiently. The core idea concerns training two shallow MLPs (the Students) by nominal images so as to mimic the mappings between the patch embeddings induced by the self-attention layers of a frozen vision Transformer (the Teacher). Indeed, learning these mappings sets forth a challenging pretext task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Softmax · Layer Normalization · Focus · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Vision Transformer
