Backtracking Improves Generation Safety

Yiming Zhang; Jianfeng Chi; Hailey Nguyen; Kartikeya Upasani; Daniel; M. Bikel; Jason Weston; Eric Michael Smith

arXiv:2409.14586·cs.LG·September 24, 2024

Backtracking Improves Generation Safety

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel, M. Bikel, Jason Weston, Eric Michael Smith

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a backtracking method using a [RESET] token that enables language models to undo unsafe generations, significantly improving safety without sacrificing helpfulness, and resisting adversarial attacks.

Contribution

The paper proposes a novel backtracking technique with a [RESET] token for language models to recover from unsafe outputs, enhancing safety and robustness.

Findings

01

Backtracking Llama-3-8B is four times safer than baseline.

02

Models trained with backtracking maintain helpfulness.

03

Backtracking provides protection against adversarial attacks.

Abstract

Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 2

Strengths

- Reflection-based Defense Yields More Flexibility: The proposed method has significant flexibility compared to other methods. With proper prompting, unlike traditional filter-based adversarial attack defense, it is easy to use a safe backtracking model with updatable preference and knowledge, depending on the user's demand. - Straight Forward Intuition and Easy Maintenence: The internal mechanism of the proposed backtracking defense is comparatively very transparent. When unexpected defensive a

Weaknesses

- Discrepancy between the Streaming User Interface and Model Behavior: When undoing some of the generation outputs, the system should delete the undone harmful trigger prompts and update it with the model's remedy context afterwards. This could cause some unexpected discrepancies and difficulties in general designs of the pipeline. - Risks in Reprogrammable Backtracking: Since we now rely on the model's own reflection to backtrack the harmful trigger prompts, it is possible, if the model itself

Reviewer 02Rating 8Confidence 4

Strengths

The strengths of the paper are as follows: - The proposed method is simple and effective. Moreover, the inference procedure is simple and does not add a lot of overhead during inference. - The authors show that when backtracking is used it is exponentially harder to get unsafe generations even when even temperature sampling is used to generate and pick the worst of $k$ responses. - Backtracking improves robustness against a few adversarial attacks without any special adversarial training or oth

Weaknesses

Here are a few weaknesses: - There is no analysis of no analysis of how the proposed backtracking method compares/complements with existing methods for reducing unsafe generations. Is the proposed method better? Are there complementary benefits? - Appendix B2 shows that on the OpenAssistant dataset the [RESET] token is generated around 12-13% of the time with no logit bias. In all these cases are there unsafe generations? If so, when the logit bias is set to -5.0, then the [RESET] token is only

Reviewer 03Rating 8Confidence 4

Strengths

Method is simple yet effective. The authors also included ablations on each step from baseline SFT to full backtracking SFT + DPO. The evaluation was done on multiple datasets and 2 SOTA base models; the authors also included detailed analyses of results and example generations. I appreciated that the details of the training setup and hyperparameter optimization were included in the appendix. Finally, the tradeoffs and limitations of assuming a single reset per generation was discussed.

Weaknesses

No major weaknesses, although given that there is a large difference in relative drop in violation rates between the 2 models tested, it would be useful to show how this method affects larger model sizes and other base models beyond Gemma and Llama (e.g., Qwen, Mistral).

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFire Detection and Safety Systems

MethodsDirect Preference Optimization · Shrink and Fine-Tune