UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Teng Li; Quanfeng Lu; Lirui Zhao; Hao Li; Xizhou Zhu; Yu Qiao; Jun Zhang; Wenqi Shao

arXiv:2506.17202·cs.CV·June 23, 2025

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

UniFork introduces a Y-shaped architecture for multimodal understanding and generation, balancing shared and task-specific layers to improve performance by aligning modalities differently for understanding and generation tasks.

Contribution

The paper proposes UniFork, a novel Y-shaped model that addresses modality alignment conflicts in unified multimodal models by sharing shallow layers and task-specific deep branches.

Findings

01

UniFork outperforms fully shared Transformer models.

02

It achieves comparable or better performance than task-specific models.

03

Analysis reveals different modality alignment patterns for understanding and generation.

Abstract

Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- The paper is mostly well written and structured. The core idea—the conflict in alignment patterns and the Y-shaped solution—is easy to follow. - The alignment pattern analysis is insightful and seems new. - On the chosen evaluation benchmarks, UniFork has shown stronger performance than the ablated instances and prior models.

Weaknesses

- Limited Understanding Benchmarks. The understanding benchmarks tested in this paper is quite outdated (e.g., VQAv2 and GQA). For visual perception, consider including benchmarks like MMBench, BLINK, CVBench, MM-VET, MMVP. - Limited baseline comparison, although Janus Pro and Bagel are cited, they are being compared against UniFork. - The paper fixes the split point between shared and task-specific layers, but doesn’t show what happens when you change how many layers are shared (e.g., early sp

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper innovatively analyzes unified generative large models from a modality alignment perspective. Based on this analysis, the proposed Y-shape structure is experimentally tested and demonstrates good performance 2. The paper's presentation is good; the figures clearly convey the conclusions and experimental results of the paper.

Weaknesses

1. The scale of the experiment is insufficient. The experimental model in the paper is only 0.5B~0.76B in size, which is too small compared to other existing unified understanding-generation models. I personally believe that it is necessary to further expand the model scale and verify the effectiveness of the method with a larger number of parameters. 2. The other methods compared in the paper are somewhat outdated. For example, widely accepted and published unified large model papers such as sh

Reviewer 03Rating 2Confidence 5

Strengths

1. Insightful analysis of modality alignment patterns: The paper provides a systematic investigation into the alignment dynamics between text and image representations across different architectures and tasks (understanding vs. generation). By empirically identifying distinct alignment patterns and connecting them to architectural design choices, the work offers valuable conceptual insights for building more principled and efficient unified multimodal models. 2. Well-documented training and impl

Weaknesses

1. Outdated and weak baselines: The compared baselines (e.g., MobileVLM, Emu, LaVIT, LDM, LWM) are mostly early-generation unified or multimodal models whose performance and architecture are now considerably behind state-of-the-art models such as Emu3, Bagel, or Janus-Pro. Even when compared to these relatively weak baselines, UniFork shows only marginal or inconsistent improvements across several benchmarks. This weakens the empirical strength of the claimed performance gains and raises questio

Code & Models

Repositories

tliby/unifork
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Transformer