Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin,, Juho Lee

TL;DR
This paper introduces Human Feedback Inversion (HFI), a framework that uses human feedback to guide the removal of harmful or copyrighted content in text-to-image diffusion models, improving ethical safety.
Contribution
The paper presents a novel HFI framework that leverages human feedback to better align model outputs with human judgments and effectively mitigate problematic content.
Findings
Significantly reduces objectionable content generation
Preserves image quality while removing harmful concepts
Provides a strong baseline for concept removal in diffusion models
Abstract
This paper addresses the societal concerns arising from large-scale text-to-image diffusion models for generating potentially harmful or copyrighted content. Existing models rely heavily on internet-crawled data, wherein problematic concepts persist due to incomplete filtration processes. While previous approaches somewhat alleviate the issue, they often rely on text-specified concepts, introducing challenges in accurately capturing nuanced concepts and aligning model knowledge with human understandings. In response, we propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. The proposed framework can be built upon existing techniques for the same purpose, enhancing their alignment with human judgment. By doing so, we simplify the training objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear reactor physics and engineering
MethodsDiffusion
