A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
David Dobre, Mehrnaz Mofakhami, Sophie Xhonneux, Leo Schwinn, Gauthier Gidel

TL;DR
This paper introduces a novel safety mitigation method for large language models by training them to insert a red flag token when harmful content is detected, enabling explicit harmfulness recognition with minimal utility loss.
Contribution
The authors propose augmenting LLMs with a red flag token to explicitly represent harmfulness, leveraging in-context learning for safety without degrading task performance.
Findings
Model can learn to insert red flag tokens to indicate harmful content.
In-context learning enables the model to initiate reflective reasoning upon flagging.
Method is orthogonal to existing safety techniques and easier to evaluate.
Abstract
Many safety post-training methods for large language models (LLMs) are designed to modify the model's behaviour from producing unsafe answers to issuing refusals. However, such distribution shifts are often brittle and degrade performance on desirable tasks. To address these pitfalls, we propose augmenting the model's vocabulary with a special red flag token, and training the model to insert this token whenever harmful content is generated or imminent. This approach enables the model to explicitly learn the concept of harmfulness in its representations, with minimal impact on utility due to the marginal change in the generated distribution of natural language. Moreover, because the token is embedded in the model's vocabulary, we can naturally leverage the LLMs' generalization capabilities, such as in-context learning (ICL) and out-of-distribution generalization to languages that are not…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is (mostly - see [Note] in Weaknesses) well written and easy to understand - Mitigating harmfulness in LLMs is a relevant topic - The proposed method is well grounded and appears to achieve robust safety performance with a marginal compromise on model utility - Despite the weaknesses listed below, the empirical evaluation of the proposed method is acceptable - but could be better
- No content provided regarding limitations or future work; See questions for examples of potential limitations - Adding a comparison with defense methods from other categories (e.g., self-reflect and controlled text generation - example of a popular approach [1]) would make the paper's contribution more compelling - The performance of one of the benchmarks (CAT) appears to be very comparable, if not better in some cases, to the proposed method - Increasing the font size on some of the plots wou
Their method is much more robust to prefill attacks than Fixed Pos RF, because that method can only output the rf token in the beginning. They show their technique works much better than standard refusals in low resource languages, because a model trained in another language can generate the same \<rf\> token no matter the language. I like using the \<rf\> token to encourage the Safety CoT, since it’s novel as far as I know, and it allows the model to recover from a false positive \<rf\> token
The novelty of their core technique is relatively low, since the method is the same as previous work, but the model is able to output the \<rf\> token anywhere instead of only at the beginning. It’s hard to tell if the technique has any benefit over CAT because their results are only a single run without error bars, and their results are similar to CAT. Clarity I find the loss in equation 1 confusing. It looks like it choses a random i, and then the loss encourages the model to output \<rf\> f
A novel approach to detect harmful generations without causing a significant distribution shift. This idea itself may be used for different purposes.
Preventing a distribution shift is a reasonable idea. However, for this purpose, one can simply use filtering approraches that pre-process inputs or post-process outputs. With a filtering approach, one does not need to post-train the LLM itself, leading to no change of distribution. The current defense mechanism is actually similar to output filtering techniques. A very naive baseline approach would train a classifier for harmfulness by using the dataset prepared for the proposed approach traini
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
