Multi-Reward as Condition for Instruction-based Image Editing
Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie, Zhu

TL;DR
This paper introduces a multi-reward training framework for instruction-based image editing that leverages a new high-quality reward dataset and a novel integration method, significantly improving editing performance over existing models.
Contribution
The paper presents a multi-reward conditioned training approach, a new reward dataset RewardEdit20K, and a comprehensive evaluation benchmark Real-Edit for instruction-based image editing.
Findings
Model with multi-reward conditioning outperforms no-reward models.
Reward dataset improves instruction following and detail preservation.
Proposed framework enhances editing quality on real-world images.
Abstract
High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in…
Peer Reviews
Decision·ICLR 2025 Poster
1. A novel reward-based instruction editing framework that introduces evaluation scores for VLLM as well as rewarding textual feedback to improve the capability of instruction editing models. 2. The creation of Real-Edit provides a standardized approach to evaluate instructional editing methods in different scenarios. 3. The method shows superior performance in both quantitative metrics and qualitative results, indicating robust editing capabilities.
1. Both the quantitative assessment in Table 1 and the training strategy are based on the same scoring strategy for the GPT-4o, making the evaluation of the results overly dependent on the a priori of the VLLM and making it difficult to objectively validate the strengths of this method. One solution to this dilemma is to use other assessment metrics (e.g., CLIP scores) on the three dimensions to be adopted for comparison with other methods. 2. While experiments have shown that the introduction o
1. This paper intorduce a multi-view reward mechanism, instead of directly improving the quality of ground-truth images, the authors utilized GPT-4o to evaluate the training data from three key perspectives: instruction adherence, detail preservation, and generation quality. 2.The RewardEdit-20K dataset and the Real-Edit evaluation benchmark.
1. The multi-view reward mechanism used in this paper relies entirely on GPT-4o’s evaluation. Although GPT-4o demonstrates strong capabilities in understanding and generating natural language, it may not fully capture the subtle nuances of human perception regarding image editing quality. 2. The cost for this reward is expensive as it using GPT4-o .
- This paper discusses the issues present in the current dataset and utilizes MLLM for data cleaning. Through experiments and comparisons, it demonstrates the effectiveness of the cleaned dataset and the new training methods. - Good writing and detailed experiments make this paper compelling.
- W.1: The paper lacks a comparison with RL in T2I methods like DPO-Diffusion[1]. If I understand correctly, I believe that the Multi-Reward Framework is conceptually similar to methods like DPO-Diffusion. Therefore, I think it is reasonable and necessary to articulate the comparisons and distinctions between these approaches, especially the method difference. - W.2: The article lacks some novelty. Of course, high-quality and abundant data can effectively enhance model performance, so I am uncer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Max Pooling · Diffusion · Convolution · U-Net · Sparse Evolutionary Training
