A training-free framework for high-fidelity appearance transfer via diffusion transformers
Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi

TL;DR
This paper introduces a training-free framework that uses diffusion transformers and a novel attention-sharing mechanism to achieve high-fidelity appearance transfer while preserving scene structure.
Contribution
It presents the first training-free method for controlling diffusion transformers for appearance transfer, disentangling structure and appearance without additional training.
Findings
Outperforms specialized methods in appearance transfer tasks.
Operates effectively at 1024px resolution.
Achieves state-of-the-art results in structural preservation and appearance fidelity.
Abstract
Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
