GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, Cihang Xie

TL;DR
This paper introduces GPT-IMAGE-EDIT-1.5M, a large-scale, high-quality dataset for instruction-guided image editing, created by refining and unifying existing datasets using GPT-4o, to facilitate open research and improve open-source models.
Contribution
The paper presents a new, publicly available dataset of 1.5 million image-editing triplets, constructed by leveraging GPT-4o to enhance quality and clarity, enabling better training of open-source image editing models.
Findings
Fine-tuned models achieve state-of-the-art performance on multiple benchmarks.
The dataset improves instruction following and perceptual quality.
Open-source models narrow the gap with proprietary systems.
Abstract
Recent advancements in large multimodal models like GPT-4o have set a new standard for high-fidelity, instruction-guided image editing. However, the proprietary nature of these models and their training data creates a significant barrier for open-source research. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available, large-scale image-editing corpus containing more than 1.5 million high-quality triplets (instruction, source image, edited image). We systematically construct this dataset by leveraging the versatile capabilities of GPT-4o to unify and refine three popular image-editing datasets: OmniEdit, HQ-Edit, and UltraEdit. Specifically, our methodology involves 1) regenerating output images to enhance visual quality and instruction alignment, and 2) selectively rewriting prompts to improve semantic clarity. To validate the efficacy of our dataset, we fine-tune…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The primary strength is the dataset itself. It is large (1.5M samples) and meticulously curated. The pipeline, which uses GPT-Image-1 for regeneration and GPT-4O for rewriting, directly addresses the known weaknesses (poor alignment, simple instructions) of prior datasets. More important, The authors commit to releasing the dataset, models, and code, which is a significant service to the community. 2. The decision to intentionally preserve difficult identity preservation (IP) cases is a sign
1. The paper's contribution is primarily empirical and data-centric. It does not propose a new model architecture, conditioning mechanism, or fusion strategy. Instead, it offers a thorough benchmark of existing components (SD3-IP2P, Flux, T5, Qwen-VL). While this analysis is valuable, the paper is an "analysis of what works" rather than a "proposal of a new method."
1. The paper presents the first publicly released, million-scale image editing dataset with unified formatting and high alignment between instructions and outputs, significantly advancing research in image editing. 2. The paper provides an in-depth investigation of channel-wise vs. token-wise conditioning, clearly demonstrating the superiority of token-wise conditioning in context-aware editing tasks. 3. Models trained on the proposed dataset achieve state-of-the-art (SOTA) performance across mu
1. After using GPT-Image-1 to generate edited images, did the authors apply any filtering or quality screening to these newly generated images? For instance, were samples with generation failures, severe distortions, or complete misalignment with the instruction removed? If filtering was performed, please detail the criteria and procedure used. 2. Why was knowledge distillation conducted exclusively from GPT-Image-1, rather than aggregating outputs from multiple top-tier closed-source or open-so
* The end-to-end data curation and training flow is easy to follow, with a helpful schematic that makes the method accessible to non-experts. * Conditioning types (channel- vs. token-wise) and text encoders (frozen T5 vs. Qwen2.5-VL) are compared on GEdit-EN and ImgEdit (Table 5). * Fine-tuning on the curated triplets reaches 7.66 on GEdit-EN-full, an absolute +0.17 over the GPT-Image-1 baseline.
**1. Minimal originality.** Beyond stitching DALL-E, GPT-Image-1, and GPT-4o into a data pipeline, the paper introduces no mechanism to detect or correct automatically edited failures or caption-rewrite hallucinations. This omission is consequential: despite distilling on ~1.5M triplets, the fine-tuned model shows only a marginal +0.17 on GEdit-EN-full and no compelling improvements on other benchmarks (see Tables 2, 3, and 4), consistent with noisy, unvetted supervision. A concrete, learnable m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques
