Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

Yujun Tong; Dongliang Chang; Zijin Yin; Xintong Liu; Yuanchen Fang; Zhanyu Ma

arXiv:2605.15792·cs.CV·May 18, 2026

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

Yujun Tong, Dongliang Chang, Zijin Yin, Xintong Liu, Yuanchen Fang, Zhanyu Ma

PDF

TL;DR

This paper introduces Generation-to-Understanding (G2U) synergy in large multimodal models, where visual generation acts as an intermediate reasoning step to enhance understanding, validated across multiple benchmarks.

Contribution

It proposes a novel G2U framework enabling controlled visual generation to improve perception without retraining, addressing asymmetry in multimodal AI.

Findings

01

G2U consistently improves understanding across twelve benchmarks.

02

Generative fidelity influences perceptual gains.

03

Distinct edit prompts affect transfer efficiency.

Abstract

The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.