WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang; Dacheng Yin; Xiaoxuan He; Yong Li; Fengyun Rao; Jing Lyu; Wei Zhai; Yang Cao; Zheng-Jun Zha

arXiv:2512.02536·cs.CV·December 3, 2025

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha

PDF

Open Access

TL;DR

This paper introduces Noisy Query Tokens and a VAE branch to improve the integration of Vision-Language Models with Diffusion Models, enabling better task generalization and continual learning in multimodal applications.

Contribution

We propose Noisy Query Tokens and a VAE-based method to enhance task generalization and continual learning in vision-language and diffusion model integration.

Findings

01

Mitigates generalization collapse in multimodal models

02

Enables stable continual learning across diverse tasks

03

Improves fine-grained image detail recovery

Abstract

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling