UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen; Minhao Jing; Weitao Lu; Yan Feng; Xiaoyu Li; Xuezhi Cao

arXiv:2512.23512·cs.CL·January 1, 2026

UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao

PDF

Open Access

TL;DR

This paper investigates whether large-scale vision-language models can enhance understanding through generation, finding that semantic-level generation improves understanding and reveals better data scaling, while pixel-level objectives may hinder performance.

Contribution

The study introduces UniHetero, a unified model demonstrating that semantic generation at large scale enhances understanding and data utilization, with effective autoregression on input embeddings.

Findings

01

Semantic generation improves understanding at large scale.

02

Pixel-level objectives can degrade understanding performance.

03

Autoregression on input embeddings captures visual details effectively.

Abstract

Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis