TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou; Binbin Gao; Guansong Pang; Xin Wang; Jiming Chen; Shibo He

arXiv:2510.21171·cs.CV·March 2, 2026

TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He

PDF

Open Access 3 Reviews

TL;DR

TokenCLIP introduces a token-wise prompt learning framework that dynamically aligns visual tokens with specialized textual subspaces using optimal transport, significantly improving zero-shot anomaly detection across diverse objects and domains.

Contribution

It proposes a novel token-wise adaptation method with dynamic alignment via optimal transport, enabling fine-grained, efficient zero-shot anomaly detection.

Findings

01

Outperforms existing zero-shot anomaly detection methods.

02

Effectively captures varied anomaly semantics across different objects.

03

Demonstrates superior performance on multiple benchmark datasets.

Abstract

Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The method is well designed, with a clear and coherent architecture 2. The writing is fluent and clear. 3. The experimental coverage is fairly comprehensive, spanning multiple datasets.

Weaknesses

1. No localization visualizations are provided, only numerical results, which weakens the credibility of the experiments. 2. There are many hyperparameters; does performance require per-dataset tuning? 3. Some implementation details are missing, e.g., OT marginals and weight settings are insufficiently specified. 4. There are several minor errors, such as multiple citations in the introduction rendered as “?”.

Reviewer 02Rating 4Confidence 5

Strengths

1.The proposed method demonstrates SOTA performance across multiple datasets. 2.The central idea of employing textual subspaces and formulating the alignment as an Optimal Transport problem is novel for this task.

Weaknesses

1.The claim that the learned spaces are "textual subspaces" is not sufficiently justified with theoretical or experimental evidence. CLIP-based zero-shot anomaly detection operates on the premise that CLIP has aligned visual features with semantic textual features during pre-training. Detection is achieved by comparing image features against textual embeddings that explicitly represent concepts like "normal" and "abnormal." However, in this work, the embeddings within the so-called "textual subs

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper identifies a clear and significant limitation in current ZSAD methods and proposes a well-motivated and elegant solution. 2. The application of Optimal Transport to dynamically align visual tokens with a set of learned subspaces is a novel contribution to the field of anomaly detection. This formulation provides a principled way to achieve fine-grained, many-to-many correspondence. 3. The experimental results are strong and comprehensive. The method shows consistent and significant

Weaknesses

1. The mechanism that drives the semantic specialization of the subspaces could be explained more clearly. The paper attributes this to the minimal cost objective of OT. However, OT's primary role is to find the most efficient matching between two fixed distributions. The specialization itself may heavily rely on the orthogonality regularization term $L_{reg}$, which explicitly forces the subspaces to be distinct. The paper would be stronger if it disentangled the contribution of OT's cost minim

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning