Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
Yujun Tong, Dongliang Chang, Zijin Yin, Xintong Liu, Yuanchen Fang, Zhanyu Ma

TL;DR
This paper introduces Generation-to-Understanding (G2U) synergy in large multimodal models, where visual generation acts as an intermediate reasoning step to enhance understanding, validated across multiple benchmarks.
Contribution
It proposes a novel G2U framework enabling controlled visual generation to improve perception without retraining, addressing asymmetry in multimodal AI.
Findings
G2U consistently improves understanding across twelve benchmarks.
Generative fidelity influences perceptual gains.
Distinct edit prompts affect transfer efficiency.
Abstract
The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
