WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong

TL;DR
WebSentinel is a novel two-step method that detects and localizes prompt injection attacks in webpages by analyzing segments of interest and their consistency with webpage content, significantly outperforming existing approaches.
Contribution
It introduces WebSentinel, a new approach that improves detection and localization of prompt injection attacks in web agents, addressing limitations of prior methods.
Findings
WebSentinel outperforms baseline methods on multiple datasets.
The two-step approach effectively identifies contaminated webpage segments.
WebSentinel demonstrates high accuracy in detecting prompt injection attacks.
Abstract
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emph{segments of interest} that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl-lxw/WebSentinel.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Spam and Phishing Detection · Network Security and Intrusion Detection
