TL;DR
This paper introduces Semantic Generative Tuning (SGT), a new approach that uses image segmentation as a generative proxy to better align visual understanding and generation in unified multimodal models, leading to improved performance.
Contribution
The work pioneers the use of hierarchical visual tasks, especially segmentation, as generative proxies to enhance multimodal model capabilities and integration.
Findings
Segmentation tasks serve as effective proxies for high-level semantic understanding.
SGT improves both perception and generative layout fidelity.
Extensive evaluations show consistent performance gains across benchmarks.
Abstract
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
