TL;DR
UniGame introduces a self-adversarial post-training method for unified multimodal models, significantly enhancing their understanding, generation, and robustness by actively challenging their own representations.
Contribution
It presents a lightweight, architecture-agnostic framework that improves UMMs through adversarial self-play, with less than 1% additional parameters and compatible with existing methods.
Findings
Improves consistency by +4.6% on GenEval
Enhances understanding by +3.6% and generation by +0.02
Boosts robustness by +4.8% and +6.2% on NaturalBench and AdVQA
Abstract
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
