PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng; Zhiyang Teng; Kunpeng Qiu; Xiangtai Li; Anran Wang; Yu Tian; Ye Tian; Haochen Wang; Zhuochen Wang

arXiv:2510.25682·cs.CL·February 10, 2026

PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Kunpeng Qiu, Xiangtai Li, Anran Wang, Yu Tian, Ye Tian, Haochen Wang, Zhuochen Wang

PDF

3 Reviews

TL;DR

PairUni introduces a novel training framework for unified vision-language models that reorganizes data into understanding-generation pairs, improving performance and generalization across diverse architectures and tasks.

Contribution

It proposes a new data organization and a pair-aware optimization method, PairGRPO, to enhance learning in UVLMs by leveraging cross-task semantic correspondences.

Findings

01

Consistent performance improvements across multiple UVLM architectures.

02

Enhanced generalization to image editing tasks without specific data.

03

Effective handling of heterogeneous data and supervision in UVLMs.

Abstract

Unified Vision-Language Models (UVLMs) perform both understanding and generation within a single architecture. Since these models rely on heterogeneous data and supervision, balancing both generation and understanding in reinforcement learning (RL) is challenging. To address this challenge, we propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. Specifically, we construct a unified paired dataset by synthesizing aligned instances via cross-modal semantic completion and retrieving semantically related samples. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present PairGRPO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- This work addresses an important problem of task interference in unified models, stemming from data and objective heterogeneity, and proposes a plausible paired-data strategy to mitigate it. - The framework demonstrates performance gains across a comprehensive suite of understanding and generation benchmarks, supported by several analyses. - The approach validates its generalizability to some extent by demonstrating effectiveness beyond autoregressive transformers, showing positive results on

Weaknesses

- My main concern lies in the problem's setup and motivation. The methodology appears applicable primarily to a somewhat niche setting where understanding and generation tasks are handled by a shared architecture. The problem of task heterogeneity could be problematic in understanding-only or generation-only VLMs. - The motivating link between gradient cosine similarity and performance (Figure 1) seems ambiguous rather than stark, raising doubts about whether task interference is the true bottle

Reviewer 02Rating 6Confidence 3

Strengths

1. The motivation of this paper is strong. Balancing understanding and generation in unified vision-language model is challenging and worth studying. 2. The proposed method is novel. The paper presents a novel approach that is composed of a data pairing pipeline and pair GRPO approach for unified optimization that minimizes interference between heterogeneous tasks. 3. The proposed method is effective. Through extensive experiments on WISE, GenEval, MMMU, MMStar, etc, the paper has shown that t

Weaknesses

1. The overall presentation needs improvements and polishments - especially the figures are not well drawn and do not help readers understand the method well enough. 2. Unified VLMs seem far worse than understanding only or generation only models. The proposed approach presents decent progress, but the gap is still significant.

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper effectively identifies and articulates a critical and timely problem in the development of UVLMs—the optimization conflict between understanding and generation tasks during unified RL. The empirical analysis presented in Figure 1, showing the correlation between gradient cosine similarity and benchmark performance, provides a strong, data-driven motivation for the proposed approach. 2. The proposed PairUni framework is elegant and logically sound. Tackling the problem at both the da

Weaknesses

1. Limited Scope of Evaluation Benchmarks: The evaluation primarily focuses on standard VQA/reasoning and text-to-image generation tasks. However, a key capability of modern UVLMs is instruction-following image editing, which requires a tight integration of understanding (the instruction) and generation (the edit). 2. Insufficient Comparison with State-of-the-Art Baselines: The set of compared models, while including the relevant Janus-Pro, could be expanded to include more recent and powerful

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.