WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

Guruprasad Viswanathan Ramesh; Asmit Nayak; Basieem Siddique; Kassem Fawaz

arXiv:2604.06367·cs.CR·April 9, 2026

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz

PDF

TL;DR

WebSP-Eval introduces a new framework for evaluating web agents on website security and privacy tasks, highlighting current models' limitations in handling UI elements like toggles and checkboxes.

Contribution

The paper presents a comprehensive evaluation framework, including a dataset, agent management system, and automated evaluator, for assessing web agents on security and privacy tasks.

Findings

01

Current models have limited exploration capabilities for security tasks.

02

Models struggle with specific websites and task categories.

03

UI elements like toggles and checkboxes cause over 45% failure rate.

Abstract

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.