Semantic Generative Tuning for Unified Multimodal Models

Songsong Yu; Yuxin Chen; Ying Shan; Yanwei Li

arXiv:2605.18714·cs.CV·May 19, 2026

Semantic Generative Tuning for Unified Multimodal Models

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

PDF

2 Repos

TL;DR

This paper introduces Semantic Generative Tuning (SGT), a new approach that uses image segmentation as a generative proxy to better align visual understanding and generation in unified multimodal models, leading to improved performance.

Contribution

The work pioneers the use of hierarchical visual tasks, especially segmentation, as generative proxies to enhance multimodal model capabilities and integration.

Findings

01

Segmentation tasks serve as effective proxies for high-level semantic understanding.

02

SGT improves both perception and generative layout fidelity.

03

Extensive evaluations show consistent performance gains across benchmarks.

Abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.