TL;DR
SafeEditor introduces a unified multi-round safety editing framework using a multi-modal large language model, significantly improving safety alignment in text-to-image models while maintaining utility and reducing over-refusal.
Contribution
The paper presents a novel post-hoc safety editing paradigm and SafeEditor, a multi-modal LLM that enables efficient, multi-round safety editing for any T2I model, advancing safety without sacrificing utility.
Findings
SafeEditor reduces over-refusal compared to prior methods.
It achieves a better safety-utility balance in T2I models.
Experimental results demonstrate superior safety editing performance.
Abstract
With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Important topic. - Introduce human cognitive process into the identification and refining of unsafe content.
- This framework highly relies on GPT-4o-generated supervision signals without validating their reliability. All components of the training data MR-SAFEEDIT are generated by GPT-4o (thought, judgement, refined prompt) and SD 3.5 (re-generated images). That basically means the paper implicitly assumes that GPT-4o is accurate and consistent in both semantic understanding and safety classification of images. However, there is no quantitative or human-validated evidence provided to support this assu
Pros: 1. They propose a multi-round, post-hoc editing process where an unsafe generated image is iteratively refined until it meets safety standards, rather than being outright rejected. 2. To train a model for this task, they constructed a large-scale, multi-round, image-text interleaved dataset. 3. The experiments are comprehensive and the results are compelling.
Cons: 1. The technical contribution is unclear. It seems this paper only proposes a pipeline and directly adopts the exiting editing methods. This paper does not propose any new editing methods for T2I safety. 2. The multi-round, iterative nature of SafeEditor (generate -> evaluate & edit -> potentially repeat) could introduce significant latency. For a real-time user-facing application, this could be a major bottleneck. The paper lacks any discussion or measurement of inference speed, which is
1. The proposed "post-hoc safety editing" is a highly innovative and practical paradigm. It mimics the human cognitive process of identifying and refining unsafe content, directly addressing the "one-size-fits-all" and over-refusal issues of existing filtering or prompt modification methods, which is critical for improving user experience. 2. The construction of the MR-SafeEdit dataset is a major contribution of this work. This large-scale dataset, comprising 27,253 multi-round editing instance
1. While the multi-round iterative editing paradigm is effective, it may introduce significant inference latency and computational overhead compared to single-pass filtering methods. The paper fails to provide an analysis of inference time or computational cost, which is crucial for the method's practical deployment. A discussion on the efficiency-performance trade-off is recommended. 2. The synthesis pipeline for the MR-SafeEdit dataset relies on GPT-4o. This dependency on a powerful, closed-s
- The paper clearly explains the problem setup and is well written and easy to follow. - I appreciate the effort invested in building the MR-SafeEdit benchmark, which could represent a valuable contribution to the community. Its construction progressively increases the safety of output images, making it potentially useful for advanced training strategies.
My main concern with the manuscript lies in the motivation and design of the proposed approach. 1. Safety methods for text-to-image generation can generally be divided into two categories: methods for hosted models and methods for open-source models. Hosted-model approaches assume that users have access only through an API and include strategies such as prompt filtering and image analysis. When users have direct access to the model, these techniques become ineffective, as they can be easily dis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
