RedacBench: Can AI Erase Your Secrets?
Hyunjun Jeon, Kyuyoung Kim, Jinwoo Shin

TL;DR
RedacBench is a new comprehensive benchmark for evaluating policy-conditioned redaction in language models, measuring their ability to remove sensitive information while preserving original semantics across diverse domains.
Contribution
We introduce RedacBench, a large-scale benchmark with annotated data and policies for assessing redaction performance in language models, addressing limitations of existing benchmarks.
Findings
Advanced models improve security by better removing sensitive info.
Preserving utility remains a significant challenge.
RedacBench enables nuanced evaluation of redaction strategies.
Abstract
Modern language models can readily extract sensitive information from unstructured text, making redaction -- the selective removal of such information -- critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text.…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper introduces a fine-grained, proposition-level analysis that captures semantic inferability of information, enabling more rigorous and interpretable quantification of redaction effectiveness than surface-level token or entity matching. - Defines complementary metrics — Security Score (true negative rate for sensitive information) and Utility Score (true positive rate for non-sensitive information) — allowing quantitative assessment of both privacy protection and information preservatio
- The benchmark evaluates models in a controlled, static setting, but it does not test interactive or context-evolving scenarios where redaction systems must operate dynamically (e.g., during live conversations or document editing). - Although the paper positions redaction as a privacy defense, it does not directly compare or align its evaluation metrics with other privacy frameworks like unlearning, membership inference resistance. - The current benchmark quantifies empirical removal, not priva
- The work moves beyond simple PII detection, formulating a more realistic task of context-sensitive redaction based on specific security policies. The inclusion of 187 multi-layered policies, spanning from granular details to high-level abstract concepts, aligns well with practical requirements. - The Security Score and Utility Score offer a clear, quantitative, and interpretable method for measuring the inherent trade-off in redaction, which is effectively visualized in the results. The author
- The evaluation framework relies on GPT-4.1-mini as an automated judge, which introduces a risk of "recall" from pre-training data contamination. Even if information is successfully redacted, the evaluator might incorrectly assess it as "preserved" (a false positive), thus artificially deflating the Security Score. The validation only checks the false negative rate and fails to assess the more critical false positive rate, posing a substantial threat to the validity of the reported scores. - Th
- Very important: The benchmark for redaction methods appears to be reasonably constructed, and I believe it will be useful for the privacy community. - Important: The shift toward inference-based measures of privacy is very valuable. Using carefully validated LLM graders seems like the right approach for this. - Important: A severe tradeoff between utility and security of current SOTA models/methods suggests that there is significant room for improvement of redaction methods on the benchmark. -
- Important: I’m struggling to get a sense of what optimal performance might look like on the dataset. Do you have examples that can be perfectly redacted by a human? It would help to know what the ceiling performance is on the benchmark and how far models are from this performance. Otherwise it may not be clear to the community how long to work on this benchmark, when it is saturated, etc. - Important: I couldn’t see any detail to who the humans were in the human-in-the-loop data construction p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Misinformation and Its Impacts
