TL;DR
This paper introduces RLBF, a reinforcement learning framework that enables large language models to self-correct safety violations through backtracking, improving robustness against adversarial attacks.
Contribution
It proposes a novel RL-based method with critic feedback and an enhanced data generation strategy to improve safety and robustness in LLMs.
Findings
RLBF significantly reduces attack success rates across benchmarks.
The method preserves core model utility while enhancing safety.
Backtracking enables models to recover from safety violations effectively.
Abstract
Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model's live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient "backtrack by x tokens" signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
