Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
Xiaoce Wang, Sifan Zhou, Kaifei Wang, Leli Xu, Xuerui Qiu, Tao He, Ming Li

TL;DR
This paper identifies low-frequency drift in VAE latent space as a key cause of semantic drift in multi-turn diffusion transformer image editing and proposes a plug-and-play alignment method to mitigate it.
Contribution
The authors introduce VAE-LFA, a training-free, plug-and-play low-frequency alignment technique that reduces semantic drift in multi-turn diffusion transformer editing.
Findings
VAE-LFA significantly improves semantic consistency in multi-turn editing.
The method is effective for both white-box and black-box diffusion models.
VAE-LFA preserves high-frequency details while suppressing low-frequency drift.
Abstract
Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
