VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho,, Emily Lawton, Yixuan Li, Jack W. Stokes

TL;DR
VLMGuard is a novel framework that detects malicious prompts in vision-language models by leveraging unlabeled data and an automated maliciousness score, eliminating the need for human annotations.
Contribution
It introduces a practical, annotation-free learning framework that effectively detects malicious prompts in VLMs using unlabeled data and automated maliciousness estimation.
Findings
VLMGuard outperforms existing detection methods.
The framework effectively leverages unlabeled prompts.
It does not require additional human annotations.
Abstract
Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- New Problem Definition: - The paper presents a practical solution to reduce dependency on labeled data, which is particularly valuable because manually labeling malicious prompts is time-consuming and expensive. - Technical Approach: - The proposed maliciousness scoring mechanism uses VLM's internal representations, which is computationally efficient as it requires only a single forward pass. - The scoring function $\kappa_i = \frac{1}{k} \sum \left( \lambda_j \cdot \langle f_i, v_j \ra
- The authors offer some geometric intuition and empirical validation, but the theoretical foundation could be clearer in a few areas: - The choice of SVD subspace analysis, while effective empirically, lacks a solid theoretical basis to confirm its effectiveness in detecting malicious patterns. - The current geometric explanation would be stronger with a formal analysis showing why this property holds across different types of attacks. - The authors use the common approach of last-token e
- This paper focuses on a critical safety issue: the misuse of VLMs through malicious or adversarial user inputs. This topic is increasingly important due to the growing popularity and widespread deployment of VLMs. - VLMGuard introduces an interesting approach by utilizing unlabeled user inputs to enhance the detection of malicious content. This method presents a promising and effective solution to the problem.
1. **The motivation behind VLMGuard is unclear.** While it is purportedly designed for VLMs, the integration of VLM concepts into the method is not evident. VLMGuard appears to function as a general binary classifier using extracted latent features applicable to any deep neural network. The lack of a clear rationale and organized presentation diminishes the method's potential significance. 2. **The presentation is wordy and lacks informativeness.** The introduction fails to provide an overarchi
- This paper explores the defense in VLM malicious generations, giving a good reference to the research on this aspect. - The proposed method VLMGUARD is simple but effective to achieve the defense, and the good performance obtained by the experiments strongly supports this point. - The ablation study is organized well to clearly demonstrate the whole proposed method. And it makes the paper easy to follow.
- I am curious about why the binary classifier outperforms the direct use of the maliciousness score for detection, as illustrated in Fig. 4. The training dataset is based on an unlabeled dataset that has been annotated with maliciousness scores. Consequently, the accuracy of the binary classifier relies on the quality of these annotations, which in turn depends on the effectiveness of the maliciousness score detection. This raises the question: **is the upper bound of the binary classifier's pe
Clarity and Simplicity of the Approach: The overall idea and methodology presented in this paper are highly intuitive and easy to follow. The process of feeding inputs into the model to obtain embeddings, followed by performing SVD, and finally identifying outliers, is clear and logically structured. This clarity allows for smooth comprehension of the workflow, making the contributions more accessible to both researchers and practitioners. Significance in Addressing a Critical Problem: The pape
Limited Novelty in Core Contribution: The core contribution of this work—applying SVD to detect malicious prompts—while effective, does not appear to be particularly novel. SVD has been extensively used in anomaly detection tasks across various domains, and its direct application here may lack the originality expected in top-tier conference submissions. The paper could benefit from further emphasizing any unique insights or enhancements introduced in the specific context of VLMs and malicious pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques
