ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Di Chang; Mingdeng Cao; Yichun Shi; Bo Liu; Shengqu Cai; Shijie Zhou; Weilin Huang; Gordon Wetzstein; Mohammad Soleymani; Peng Wang

arXiv:2506.03107·cs.CV·June 12, 2025

ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang

PDF

Open Access 1 Repo 1 Models 3 Datasets 3 Reviews

TL;DR

ByteMorph introduces a large-scale dataset and baseline model for instruction-guided image editing involving complex non-rigid motions, addressing a significant gap in dynamic scene editing capabilities.

Contribution

The paper presents ByteMorph, a new dataset and model specifically designed for non-rigid motion editing, expanding the scope of instruction-based image editing beyond static and rigid transformations.

Findings

01

ByteMorph-6M contains over 6 million high-quality image pairs.

02

ByteMorpher outperforms existing methods on the ByteMorph-Bench benchmark.

03

Comprehensive evaluation reveals strengths and limitations of current instruction-guided editing techniques.

Abstract

Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper tackles a well-defined and high-impact limitation of current image editing models. Non-rigid motion is a fundamental aspect of the visual world, and enabling instruction-based control over it is very important. 2. Experimental results are comprehensive, which make this paper very solid.

Weaknesses

My main concern is about the dataset construction. The entire training set (ByteMorph-6M) relies entirely on a synthetic pipeline (ChatGPT-4o and Seaweed). I mean, 6.4M data comes from a single video generation model. This directly results in the dataset's quality being limited by the capabilities of this specific video model. It seems like that training on ByteMorph-6M is actually distilling Seawead. Specifically, if the Seawead model has systematic limitations in generating certain types of m

Reviewer 02Rating 4Confidence 4

Strengths

1. The dataset is well-motivated. ByteMorph-6M effectively addresses the gap in motion-focused editing data with comprehensive coverage of non-rigid transformations. The release of datasets, benchmarks, and code provides significant value to the community. 2. The automated construction using video generation and VLMs ensures scalability while maintaining quality and semantic coherence.

Weaknesses

1. The technical contribution is limited, as the methodology relies purely on fine-tuning a pre-existing DiT model (FLUX.1-dev) on ByteMorph-6M without introducing architectural innovations or tailored optimizations. 2. The experimental analysis does not sufficiently address the trade-offs of fine-tuning on ByteMorph-6M. While it is intuitive that specialization improves motion-specific performance, the paper omits evaluation of the model's original capabilities on standard instruction-based ben

Reviewer 03Rating 8Confidence 4

Strengths

- Elevates motion (non-rigid, articulation, camera pose) as a first-class editing dimension; introduces $ \mathrm{CLIP\text{-}D}_{\text{img}} $ to evaluate edits as *changes* rather than absolute content similarity. - Large-scale training set + curated hard benchmark; broad comparisons (open-source & industrial), repeated sampling, and both human and VLM judgments. - Clear problem framing and taxonomy; tables position prior editing datasets/methods effectively; training details (backbone, losses

Weaknesses

- Training relies on synthesized videos; even with filtering, motion/texture statistics may diverge from real-world photos. Add more purely real frame-pair data or zero/low-shot tests on real-image edit benchmarks to quantify the gap. - Provide a larger rank-correlation study (Spearman/Kendall) per category (camera/human/object/interaction) and analyze systematic failure modes (e.g., composite camera + articulation edits). - Since training uses latent concatenation (source, target), include a co

Code & Models

Repositories

bytedance-seed/bm-code
pytorchOfficial

Models

🤗
ByteDance-Seed/BM-Model
model· ♡ 3
♡ 3

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques · Multimodal Machine Learning Applications

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Diffusion · Focus