Discovering Forbidden Topics in Language Models

Can Rager; Chris Wendler; Rohit Gandikota; David Bau

arXiv:2505.17441·cs.CL·June 12, 2025

Discovering Forbidden Topics in Language Models

Can Rager, Chris Wendler, Rohit Gandikota, David Bau

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new method called IPC for discovering topics that language models refuse to discuss, revealing biases and censorship patterns across various models, and emphasizing the importance of refusal detection for AI safety.

Contribution

The paper presents the IPC method for identifying forbidden topics in language models and benchmarks it across multiple models, including open-source and proprietary ones, uncovering censorship and bias patterns.

Findings

01

IPC successfully retrieves most forbidden topics within limited prompts.

02

Models exhibit censorship behaviors aligned with political agendas.

03

Refusal discovery reveals biases and safety issues in language models.

Abstract

Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token prefilling to find forbidden topics. We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawler to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The methodology is practical, requiring only API access with prefilling capabilities rather than model weights or gradient access. The benchmarking on Tulu-3-8B with known ground truth establishes credibility before scaling to proprietary models. The discovery of thought suppression as a censorship mechanism in DeepSeek-R1 is both technically interesting and concerning from a transparency perspective. The finding that quantization can reintroduce censorship in supposedly "decensored" models i

Weaknesses

The paper relies on prefilling to discover and validate refusals. The paper is missing topics that require different jailbreaking techniques or exhibiting confirmation bias. The baseline comparison is weak. The authors acknowledge but that DeepSeek-R1 and Tulu-3 differ fundamentally in architecture, training objectives, and data, It is not clear to me how these differences in data translate into what looks like differences in refusal. Take the examples given, if no literature on Tiananmen Squa

Reviewer 02Rating 4Confidence 4

Strengths

The paper introduces an important yet novel problem formulation and proposes a sensible method to address the problem. Specifically, it introduces the unsupervised curation of the topics an LLM refuses to discuss. It is well motivated, and well positioned, especially in so far as the need for methods beyond supervised dataset testing. They suggest a novel solution via prompting/prefilling CoT. Additionally, they propose a ‘crawling’ method that takes already found ‘forbidden topics’ and uses t

Weaknesses

- Missing sensible baselines and ablations. The prompt efficiency of the baseline method is far better than the IPC method, suggesting that some improvement in prompt engineering/context could close the gap. For example, incorporating adaptivity, e.g prompt including current list of forbidden terms, into the baseline, may improve things. Also, this would serve as an ablation that tests the importance of the different elements of the proposed method, e.g. seeding vs pre-filling. - Lack of clarit

Reviewer 03Rating 6Confidence 4

Strengths

- The problem of refusal discovery is novel, interesting, and highly relevant—particularly given that even open-weight models often do not share details of their training data. - The paper takes thoughtful steps to evaluate a method that is inherently difficult to assess deeply. The experiments on Tulu-3-8B, where ground truth forbidden topics are known, provide valuable validation. - The fact that a prompting baseline could not discover the CCP-related topics in R1-70B, while the prefill-based

Weaknesses

- The presentation of results on deployed models feels underdeveloped. Table 2 is the only table showing results from crawling popular models, and I wish there were more detail and analysis in this direction. For example: - A key advantage of the IPC method is that the prompting baseline failed to find CCP topics in R1-70B, but this finding is currently buried as a single sentence in the Tulu-8B results section. - The quantization results on decensored R1 are fascinating and arguably the mo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training