FLAME: Flexible LLM-Assisted Moderation Engine
Ivan Bakulin (1, 2), Ilia Kopanichuk (1, 2), Iaroslav Bespalov, (1), Nikita Radchenko (3), Vladimir Shaposhnikov (1, 4), Dmitry Dylov (1, and 4), Ivan Oseledets (1, 4) ((1) AIRI, (2) Moscow Institute of Physics, and Technology, (3) SberHealth

TL;DR
FLAME introduces a response-based moderation system for LLMs that improves robustness against jailbreaking and offers flexible safety criteria, outperforming traditional input filtering methods in efficiency and effectiveness.
Contribution
This paper presents FLAME, a novel output moderation approach that enhances resistance to jailbreaking and allows customizable safety filtering, addressing limitations of current input filtering systems.
Findings
FLAME reduces jailbreaking success rates by approximately 9 times.
The system maintains low computational overhead during moderation.
FLAME demonstrates superior performance across various LLMs.
Abstract
The rapid advancement of Large Language Models (LLMs) has introduced significant challenges in moderating user-model interactions. While LLMs demonstrate remarkable capabilities, they remain vulnerable to adversarial attacks, particularly ``jailbreaking'' techniques that bypass content safety measures. Current content moderation systems, which primarily rely on input prompt filtering, have proven insufficient, with techniques like Best-of-N (BoN) jailbreaking achieving success rates of 80% or more against popular LLMs. In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a new approach that shifts the focus from input filtering to output moderation. Unlike traditional circuit-breaking methods that analyze user queries, FLAME evaluates model responses, offering several key advantages: (1) computational efficiency in both training and inference, (2) enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Boron Compounds in Chemistry
MethodsFocus
