TL;DR
LatentUMM introduces a dual latent alignment framework to explicitly align transformations in shared latent spaces, enhancing cross-modal consistency in unified multimodal models.
Contribution
It proposes a novel dual latent alignment approach with latent dynamics stabilization to improve semantic consistency in multimodal models.
Findings
Improves cross-modal semantic consistency across architectures
Enhances robustness via stochastic latent rollouts
Achieves better alignment between generation and re-encoding processes
Abstract
Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
