Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

Xiaoce Wang; Sifan Zhou; Kaifei Wang; Leli Xu; Xuerui Qiu; Tao He; Ming Li

arXiv:2605.08250·cs.CV·May 12, 2026

Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

Xiaoce Wang, Sifan Zhou, Kaifei Wang, Leli Xu, Xuerui Qiu, Tao He, Ming Li

PDF

TL;DR

This paper identifies low-frequency drift in VAE latent space as a key cause of semantic drift in multi-turn diffusion transformer image editing and proposes a plug-and-play alignment method to mitigate it.

Contribution

The authors introduce VAE-LFA, a training-free, plug-and-play low-frequency alignment technique that reduces semantic drift in multi-turn diffusion transformer editing.

Findings

01

VAE-LFA significantly improves semantic consistency in multi-turn editing.

02

The method is effective for both white-box and black-box diffusion models.

03

VAE-LFA preserves high-frequency details while suppressing low-frequency drift.

Abstract

Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.