Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long; Mingzhe Zheng; Kunyu Feng; Xinhua Zhang; Hongyu Liu; Harry Yang; Linfeng Zhang; Qifeng Chen; Yue Ma

arXiv:2508.08134·cs.CV·February 24, 2026

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

PDF

3 Reviews

TL;DR

Follow-Your-Shape introduces a novel shape-aware image editing framework that enables precise, controllable shape modifications while preserving non-target regions, outperforming existing flow-based models especially in large-scale shape transformations.

Contribution

The paper presents a training-free, mask-free method utilizing Trajectory Divergence Maps and Scheduled KV Injection for improved shape editing, along with a new benchmark for evaluation.

Findings

01

Achieves superior shape editing fidelity and control.

02

Effectively handles large-scale shape transformations.

03

Outperforms existing flow-based models in visual quality.

Abstract

While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The general insight of the paper to use magnitude of trajectory difference for localizing edits is both clever and intuitive. - The methods were generally well-motivated and clearly explained. - The paper is overall well-written and polished.

Weaknesses

- Figure 2 is quite confusing. For example, is that row in the top right a legend for left and right sides of the figure? If so, it could be labelled more clearly. Also, for the bottom figure, the outline colors of the frames (particularly blue and orange) are very hard to notice. - The quantitative evaluation only includes results on the paper's custom evaluation dataset, but does not report results of a third-party editing dataset, such as PIE-Bench. Although PIE-Bench does not isolate shape c

Reviewer 02Rating 6Confidence 4

Strengths

- Trajectory-based region control. TDM offers a conceptually grounded way to localize editable regions by exploiting velocity differences in rectified-flow trajectories, avoiding reliance on noisy attention maps or external segmentation. - Training-free and mask-free pipeline. The method operates purely with a pre-trained FLUX model and does not require hand-drawn or model-generated masks, which reduces annotation overhead and simplifies deployment. - Consistent empirical gains. Across ReShap

Weaknesses

- Global PSNR and LPIPS cannot disentangle background preservation from foreground edits, so the core claim of “preserving non-edited regions” is only indirectly tested. Some region-restricted metrics would provide stronger evidence. - The paper does not empirically compare TDM to simpler region-selection strategies (e.g. DiffEdit-style prediction differences, cross-attention masks) when used within the same staged KV injection scheme, leaving the unique benefit of TDM somewhat under-quantified

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper identifies a significant and challenging problem. Large-scale shape editing is a major failure case for most SOTA methods. 2. Using divergence between source and target flow trajectories to infer editable regions is original and well-motivated. It moves beyond cross-attention saliency or explicit user masks toward a model-intrinsic notion of semantic locality. 3. The proposed approach can be applied to existing pre-trained flow models without finetuning or additional training data.

Weaknesses

1. The proposed method employs ControlNet guidance (depth/canny maps) during the final editing stage to preserve structure and edges. However, none of the baselines (FlowEdit, RF-Solver, KV-Edit, MasaCtrl, DiT4Edit, etc.) use ControlNet or equivalent structural conditioning. As ControlNet introduces a strong external geometric prior, the resulting improvements in PSNR, LPIPS, and boundary fidelity cannot be attributed solely to the proposed TDM mechanism. As a result, the unfair bassline compari

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.