PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation
Melanie Neubauer, Elmar Rueckert, Christian Rauch

TL;DR
PASTA introduces a weakly supervised vision transformer-based method for real-time, pixel-precise target and anomaly segmentation in unstructured environments, reducing training time significantly.
Contribution
It presents a novel weakly supervised pipeline using ViT features and semantic prompts for zero-shot segmentation, outperforming domain-specific baselines.
Findings
Achieves up to 88.3% IoU for target segmentation.
Reduces training time by 75.8% compared to baselines.
Performs well on industrial and agricultural datasets.
Abstract
Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
