Evaluating Language Model Reasoning about Confidential Information

Dylan Sam; Alexander Robey; Andy Zou; Matt Fredrikson; J. Zico Kolter

arXiv:2508.19980·cs.LG·August 28, 2025

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter

PDF

1 Datasets

TL;DR

This paper introduces PasswordEval, a benchmark to assess whether language models can correctly identify authorized requests involving confidential information, revealing current models' struggles and potential safety risks in high-stakes applications.

Contribution

The paper develops a new benchmark, PasswordEval, to evaluate language models' ability to handle confidential info and demonstrates their limitations in reasoning and safety in high-stakes contexts.

Findings

01

Models struggle with password verification tasks.

02

Reasoning traces often leak confidential information.

03

Performance does not improve with increased reasoning complexity.

Abstract

As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale…

Figures8

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

haritzpuerto/password_eval-contextual-integrity
dataset· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.