Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Yusheng Dai; Chenxi Wang; Chang Li; Chen Wang; Jun Du; Kewei Li; Ruoyu Wang; Jiefeng Ma; Lei Sun; Jianqing Gao

arXiv:2502.05130·cs.SD·July 30, 2025

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Jun Du, Kewei Li, Ruoyu Wang, Jiefeng Ma, Lei Sun, Jianqing Gao

PDF

Open Access 2 Repos

TL;DR

This paper presents Swap Forward (SaFa), a novel latent swap joint diffusion method that enhances long-form 2D and audio generation by improving spectrum coherence and cross-view consistency, outperforming existing methods in quality and speed.

Contribution

The paper introduces Self-Loop Latent Swap and Reference-Guided Latent Swap techniques to address spectrum aliasing and global consistency in joint diffusion, advancing long-form multi-view generation.

Findings

01

SaFa outperforms existing joint diffusion methods in audio quality.

02

SaFa achieves comparable panorama generation with 2-20x faster speed.

03

SaFa demonstrates better high-frequency preservation and view consistency.

Abstract

This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsConvolution · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · U-Net · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion