Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model   Against LLM Red-Teaming

Jiaxu Liu; Xiangyu Yin; Sihao Wu; Jianhong Wang; Meng Fang; Xinping; Yi; Xiaowei Huang

arXiv:2405.12604·cs.CL·June 19, 2024

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

Jiaxu Liu, Xiangyu Yin, Sihao Wu, Jianhong Wang, Meng Fang, Xinping, Yi, Xiaowei Huang

PDF

Open Access

TL;DR

This paper presents a lightweight prefix-based sentinel model that enhances LLM safety against red-teaming by reconstructing prompts with minimal tokens, improving robustness without fine-tuning large models.

Contribution

Introduction of a plug-and-play sentinel prefix model optimized via interleaved PPO training to improve LLM safety without fine-tuning large models.

Findings

01

Effective in reducing toxicity across multiple LLMs

02

Works with large models like Llama-2, GPT-3.5, and Stable-Diffusion

03

Maintains efficiency with fewer than 30 additional tokens

Abstract

With the proliferation of red-teaming strategies for Large Language Models (LLMs), the deficiency in the literature about improving the safety and robustness of LLM defense strategies is becoming increasingly pronounced. This paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few ( $< 30$ ) additional tokens, effectively reducing toxicity in responses from target LLMs. The sentinel model naturally overcomes the \textit{parameter inefficiency} and \textit{limited model accessibility} for fine-tuning large target models. We employ an interleaved training regimen using Proximal Policy Optimization (PPO) to optimize both red team and sentinel models dynamically, incorporating a value head-sharing mechanism inspired by the multi-agent centralized critic to manage the complex interplay between agents. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques