TL;DR
PairUni introduces a novel training framework for unified vision-language models that reorganizes data into understanding-generation pairs, improving performance and generalization across diverse architectures and tasks.
Contribution
It proposes a new data organization and a pair-aware optimization method, PairGRPO, to enhance learning in UVLMs by leveraging cross-task semantic correspondences.
Findings
Consistent performance improvements across multiple UVLM architectures.
Enhanced generalization to image editing tasks without specific data.
Effective handling of heterogeneous data and supervision in UVLMs.
Abstract
Unified Vision-Language Models (UVLMs) perform both understanding and generation within a single architecture. Since these models rely on heterogeneous data and supervision, balancing both generation and understanding in reinforcement learning (RL) is challenging. To address this challenge, we propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. Specifically, we construct a unified paired dataset by synthesizing aligned instances via cross-modal semantic completion and retrieving semantically related samples. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present PairGRPO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This work addresses an important problem of task interference in unified models, stemming from data and objective heterogeneity, and proposes a plausible paired-data strategy to mitigate it. - The framework demonstrates performance gains across a comprehensive suite of understanding and generation benchmarks, supported by several analyses. - The approach validates its generalizability to some extent by demonstrating effectiveness beyond autoregressive transformers, showing positive results on
- My main concern lies in the problem's setup and motivation. The methodology appears applicable primarily to a somewhat niche setting where understanding and generation tasks are handled by a shared architecture. The problem of task heterogeneity could be problematic in understanding-only or generation-only VLMs. - The motivating link between gradient cosine similarity and performance (Figure 1) seems ambiguous rather than stark, raising doubts about whether task interference is the true bottle
1. The motivation of this paper is strong. Balancing understanding and generation in unified vision-language model is challenging and worth studying. 2. The proposed method is novel. The paper presents a novel approach that is composed of a data pairing pipeline and pair GRPO approach for unified optimization that minimizes interference between heterogeneous tasks. 3. The proposed method is effective. Through extensive experiments on WISE, GenEval, MMMU, MMStar, etc, the paper has shown that t
1. The overall presentation needs improvements and polishments - especially the figures are not well drawn and do not help readers understand the method well enough. 2. Unified VLMs seem far worse than understanding only or generation only models. The proposed approach presents decent progress, but the gap is still significant.
1. The paper effectively identifies and articulates a critical and timely problem in the development of UVLMs—the optimization conflict between understanding and generation tasks during unified RL. The empirical analysis presented in Figure 1, showing the correlation between gradient cosine similarity and benchmark performance, provides a strong, data-driven motivation for the proposed approach. 2. The proposed PairUni framework is elegant and logically sound. Tackling the problem at both the da
1. Limited Scope of Evaluation Benchmarks: The evaluation primarily focuses on standard VQA/reasoning and text-to-image generation tasks. However, a key capability of modern UVLMs is instruction-following image editing, which requires a tight integration of understanding (the instruction) and generation (the edit). 2. Insufficient Comparison with State-of-the-Art Baselines: The set of compared models, while including the relevant Janus-Pro, could be expanded to include more recent and powerful
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
