LLM-Assisted Red Teaming of Diffusion Models through "Failures Are   Fated, But Can Be Faded"

Som Sagar; Aditya Taparia; Ransalu Senanayake

arXiv:2410.16738·cs.LG·October 23, 2024

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Som Sagar, Aditya Taparia, Ransalu Senanayake

PDF

Open Access

TL;DR

This paper enhances a framework for exploring and mitigating failure modes in large generative models, specifically diffusion models, using reinforcement learning, LLM-based rewards, and limited human feedback to improve model robustness.

Contribution

It introduces a novel combination of reinforcement learning algorithms, screening tests, and LLM-based rewards to better characterize and restructure failure landscapes in diffusion models.

Findings

01

Effective identification of failure modes in diffusion models

02

Demonstrated ability to restructure failure landscapes

03

Analyzed strengths and weaknesses of each algorithm

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug or audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we improve the "Failures are fated, but can be faded" framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear reactor physics and engineering

MethodsDiffusion