Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
Abinav Rao, Sujan Rachuri

TL;DR
This study systematically investigates whether DPO can align both understanding and generation in unified multimodal models, finding that generation quality resists alignment due to gradient orthogonality and tokenization bottlenecks.
Contribution
First comprehensive analysis of DPO's effectiveness on unified multimodal models, revealing fundamental limitations and structural challenges in aligning understanding and generation capabilities.
Findings
DPO does not improve generation quality across tested conditions.
Understanding and generation gradients are nearly orthogonal, causing interference.
VQ tokenization is identified as a key structural bottleneck.
Abstract
Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
