Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, Yu Cheng

TL;DR
This paper systematically investigates whether unified vision-language models (VLMs) that combine understanding and generation tasks can mutually enhance each other, demonstrating that mixed training data and better alignment improve generalization across tasks.
Contribution
It provides the first comprehensive analysis of cross-task generalization in unified VLMs, highlighting the importance of data alignment and knowledge transfer between understanding and generation.
Findings
Unified VLMs trained with mixed data benefit both understanding and generation.
Better alignment between input and output spaces improves generalization.
Generation knowledge transfers to understanding tasks within the base language model.
Abstract
Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in…
Peer Reviews
Decision·Submitted to ICLR 2026
- Comparison in Fair setting: authors build controlled datasets that contain both understanding and generation signals and then train unified vs task-specific models on the same budget - Identifying visual-space alignment as the key driver, they break the alignment and show the gains drop. - The constructed case where generation has a concept and understanding doesn’t, and the unified model still learns it, is a compelling demonstration of cross-task transfer
- Prior works such as MetaMorph[1] have provided some evidence of mutual benefit transfer between understanding and generation tasks, limiting the novelty of the findings here. - This work primarily uses smaller datasets built with rule-based text and attribute-style supervision (SmartWatch, templated CelebA) where concepts are annotated, which is not a realistic setting for VLMs. - The training tasks are closely related on both understanding and generation side: The understanding side is basic
- The paper investigates in depth an important and timely topic in multimodal model design. It remains unclear whether specialized or unified VLMs are preferable, and understanding the trade-offs between these approaches is valuable. - The experimental setup is systematic and controlled, using synthetic datasets that help isolate the factors influencing cross-task generalization.
- The paper is unclear about how the proposed unified VLMs actually work. Figure 1 doesn’t really help in understanding the architecture, and several components are insufficiently explained. In particular, it’s not clear what the 'generation vision adapter' does: if the LLM outputs image tokens directly, why do these need to be adapted before being fed back into the model? The generation process for the SigLIP-SigLIP and LLaVA settings is also confusing. For models with a VQ decoder, image gener
1. Good motivation with principled analysis: the paper provides controlled, reproducible experiments to validate the features about unified VLMs. 2. Good experimental design: two datasets with full controllability and detailed ablations (alignment, scaling, bias). 3. Clear practical relevance: In addition to empirical experiments, the paper also provides actionable guidelines (maintaining aligned latent spaces, balancing task ratios).
1. Limited scope of architectures: The paper excludes diffusion-based or hybrid models (e.g., Transfusion, Emu3). Especially, the generation performance of SigLIP-SigLIP and VQ-SigLIP are missing. In addition, Harmon [a] adopts MAR encoder (different from SigLIP or VQ), which is not discussed and included in experiments. This somewhat narrows the generality of the conclusions. 2. Synthetic bias: Although the synthetic datasets provide control, they are relatively simple compared to real-world m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsBalanced Selection
