Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

TL;DR
Uni-Edit introduces a unified, scalable approach to train multimodal models for image understanding, generation, and editing using a single task, dataset, and training stage, overcoming previous multi-task conflicts.
Contribution
It proposes Uni-Edit as the first general task for UMM tuning, with an automated data synthesis pipeline transforming VQA data into complex editing instructions.
Findings
Tuning on Uni-Edit improves all three capabilities without auxiliary operations.
The Uni-Edit-148k dataset enables effective training for complex editing tasks.
Experiments show comprehensive performance gains on BAGEL and Janus-Pro.
Abstract
Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
