ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang

TL;DR
ByteMorph introduces a large-scale dataset and baseline model for instruction-guided image editing involving complex non-rigid motions, addressing a significant gap in dynamic scene editing capabilities.
Contribution
The paper presents ByteMorph, a new dataset and model specifically designed for non-rigid motion editing, expanding the scope of instruction-based image editing beyond static and rigid transformations.
Findings
ByteMorph-6M contains over 6 million high-quality image pairs.
ByteMorpher outperforms existing methods on the ByteMorph-Bench benchmark.
Comprehensive evaluation reveals strengths and limitations of current instruction-guided editing techniques.
Abstract
Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper tackles a well-defined and high-impact limitation of current image editing models. Non-rigid motion is a fundamental aspect of the visual world, and enabling instruction-based control over it is very important. 2. Experimental results are comprehensive, which make this paper very solid.
My main concern is about the dataset construction. The entire training set (ByteMorph-6M) relies entirely on a synthetic pipeline (ChatGPT-4o and Seaweed). I mean, 6.4M data comes from a single video generation model. This directly results in the dataset's quality being limited by the capabilities of this specific video model. It seems like that training on ByteMorph-6M is actually distilling Seawead. Specifically, if the Seawead model has systematic limitations in generating certain types of m
1. The dataset is well-motivated. ByteMorph-6M effectively addresses the gap in motion-focused editing data with comprehensive coverage of non-rigid transformations. The release of datasets, benchmarks, and code provides significant value to the community. 2. The automated construction using video generation and VLMs ensures scalability while maintaining quality and semantic coherence.
1. The technical contribution is limited, as the methodology relies purely on fine-tuning a pre-existing DiT model (FLUX.1-dev) on ByteMorph-6M without introducing architectural innovations or tailored optimizations. 2. The experimental analysis does not sufficiently address the trade-offs of fine-tuning on ByteMorph-6M. While it is intuitive that specialization improves motion-specific performance, the paper omits evaluation of the model's original capabilities on standard instruction-based ben
- Elevates motion (non-rigid, articulation, camera pose) as a first-class editing dimension; introduces $ \mathrm{CLIP\text{-}D}_{\text{img}} $ to evaluate edits as *changes* rather than absolute content similarity. - Large-scale training set + curated hard benchmark; broad comparisons (open-source & industrial), repeated sampling, and both human and VLM judgments. - Clear problem framing and taxonomy; tables position prior editing datasets/methods effectively; training details (backbone, losses
- Training relies on synthesized videos; even with filtering, motion/texture statistics may diverge from real-world photos. Add more purely real frame-pair data or zero/low-shot tests on real-image edit benchmarks to quantify the gap. - Provide a larger rank-correlation study (Spearman/Kendall) per category (camera/human/object/interaction) and analyze systematic failure modes (e.g., composite camera + articulation edits). - Since training uses latent concatenation (source, target), include a co
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques · Multimodal Machine Learning Applications
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Diffusion · Focus
