HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang,, Yuyin Zhou, Cihang Xie

TL;DR
HQ-Edit is a large, high-quality dataset for instruction-based image editing created using advanced foundation models, significantly improving model performance and surpassing human-annotated data in quality.
Contribution
The paper presents a scalable pipeline for creating high-quality image editing datasets using GPT-4V and DALL-E 3, and introduces new evaluation metrics for assessing edit quality.
Findings
HQ-Edit dataset contains 200,000 high-quality edits.
Finetuning InstructPix2Pix on HQ-Edit achieves state-of-the-art results.
Proposed metrics effectively evaluate image editing quality.
Abstract
This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs featuring input and output images with detailed text prompts, followed by precise alignment ensured through post-processing. In addition, we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For…
Peer Reviews
Decision·ICLR 2025 Poster
1. Compared with previous work, the proposed dataset is high quality regarding to image resolution, content/edit-type diversity and prompt-image alignment. 2. Baseline method InstructPix2Pix finetuned on the proposed HQ-Edit achieves state-of-the-art performance. 3. The proposed data curation pipeline is scalable by leveraging pretrained generative model (e.g. DALL·E3) and visual-language model (e.g. GPT-4 / GPT-4V). 4. The paper is well-organized and easy to follow.
1. Compared with previous work, e.g. MagicBrush, the source images from HQ-Edit are generated by DALLE-3, which may introduce distribution bias between AIGC and photo realistic contents. 2. The necessity/importance analysis of using diptych generation is missing. 3. The proposed two metrics Alignment and Coherence are mainly used in the main evaluation. Given the validated limitation of CLIP directional similarity, other commonly used metrics are missing for quantitative evaluation, which may le
1. The proposed HQ-Edit can be a training data for instruction-based image editing task, which can promote the development of this area. 2. The performance of finetuned InstructPix2Pix has proven the effectiveness of HQ-Edit. 3. The proposed evaluation metrics are superior to the CLIP score.
1. It seems that HQ-Edit only contains non-rigid pair data. 2. There are many types of operations for instruction-based image editing task (e.g., object addition, object removal, non-rigid operation, local transformation, global transformation). The figure 8 and figure 9 only show the transformation part. The author should show the results of all these operations to make a comprehensive comparison. 3. Regarding metrics, since there is already a large amount of Human Evaluation Scores, why not us
1. High-Quality Dataset: The paper introduces HQ-Edit, a dataset with approximately 200,000 high-quality image edits, which is a significant contribution to the field of instruction-based image editing. 2. Advanced Foundation Models: Leveraging state-of-the-art models like GPT-4V and DALL-E 3 ensures that the dataset benefits from the latest advancements in AI, leading to high-resolution and detailed images. 3. Broad Coverage of Editing Operations: HQ-Edit covers a wide range of editing tasks, f
1. Synthetic Data Limitations: Although the synthetic images are useful for training, the trained model may not perform well on real images. 2. Constrained Persuasiveness of Evaluation Metrics:The paper only conducted comparisons on the two evaluation metrics it proposed, Alignment and Coherence, without making comparisons on more widely used and popular metrics.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
