Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
Steven Au, Sujit Noronha

TL;DR
This paper introduces PPT-Bench, a benchmark for evaluating epistemic attacks on large language models through philosophical pressures, revealing model weaknesses and testing mitigation strategies.
Contribution
It presents a novel diagnostic benchmark based on a philosophical taxonomy to evaluate epistemic failures and defenses in large language models.
Findings
Different pressure types produce distinct inconsistency patterns.
Model weaknesses vary significantly across pressure types.
Certain mitigation strategies like prompt anchoring are more effective in API settings.
Abstract
Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
