Latent Swap Joint Diffusion for 2D Long-Form Latent Generation
Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Jun Du, Kewei Li, Ruoyu Wang, Jiefeng Ma, Lei Sun, Jianqing Gao

TL;DR
This paper presents Swap Forward (SaFa), a novel latent swap joint diffusion method that enhances long-form 2D and audio generation by improving spectrum coherence and cross-view consistency, outperforming existing methods in quality and speed.
Contribution
The paper introduces Self-Loop Latent Swap and Reference-Guided Latent Swap techniques to address spectrum aliasing and global consistency in joint diffusion, advancing long-form multi-view generation.
Findings
SaFa outperforms existing joint diffusion methods in audio quality.
SaFa achieves comparable panorama generation with 2-20x faster speed.
SaFa demonstrates better high-frequency preservation and view consistency.
Abstract
This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsConvolution · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · U-Net · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
