UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang; Chenwei Xie; Xiaoyi Bao; Tingyu Weng; Pandeng Li; Yun Zheng; Liwei Wang

arXiv:2507.23278·cs.CV·February 10, 2026

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

PDF

Open Access 3 Models 3 Reviews

TL;DR

UniLIP is a unified framework that enhances CLIP with reconstruction, reasoning, and editing capabilities, enabling it to excel in multimodal understanding, generation, and editing with high fidelity and efficiency.

Contribution

The paper introduces a novel two-stage training scheme and dual-condition architecture that adapt CLIP for high-quality reconstruction, reasoning, and editing, surpassing larger models.

Findings

01

Achieves state-of-the-art scores on GenEval, WISE, and ImgEdit benchmarks.

02

Outperforms larger models like BAGEL and Uniworld-V1 with fewer parameters.

03

Demonstrates strong instruction following and editing fidelity.

Abstract

In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The two-stage training plus self-distillation scheme for CLIP is well-motivated and appears to strike a good balance between semantic preservation and detail fidelity. 2. UniLIP demonstrates strong empirical performance across popular benchmarks in image understanding, editing, and generation.

Weaknesses

1. UniLIP's training objective are heavily reconstruction-centric and largely pixel-level, while self-distillation predominantly constrains the representation not to deviate from the original CLIP distribution. This setup may limit the model’s ability to discover a better feature distribution for both understanding and generation. Why the self-distillation is sufficient for preserving its understanding-centered semantics, will any proxy task such as image classification better than distillation?

Reviewer 02Rating 4Confidence 5

Strengths

The proposed pipeline extends several previous works and is simple yet effective. Finetuning CLIP through self-distillation makes sense and proves effective, while direct optimization for reconstruction frustrates understanding. The dual-condition of mulitimodal hidden states and query embeddings makes downstream diffusion transformer performs better than using either. Comparisons among state-of-the-art models are conducted and analyses are convincing. 1. Rigorous Ablation of the Two-Stage Train

Weaknesses

1. Insufficient Literature Review on Query Embeddings: The dual-condition architecture relies on query embeddings to connect the MLLM and the diffusion transformer, following precedents like MetaQuery (Pan et al., 2025) and BLIP3-o (Chen et al., 2025a). However, the earliest proposal was made by DreamLLM (Dong et al., ICLR 2024 Spotlight). 2. Lack of Explicit Architectural Comparison to Dual-Encoder Baselines: UniLIP successfully combines high-level semantics and low-level pixel details by adapt

Reviewer 03Rating 6Confidence 3

Strengths

This paper presents a training approach for CLIP that effectively balances semantic understanding and pixel-level detail, achieving SOTA performance in various multimodal tasks. The presentation is clear and well-structured, with comprehensive experiments and ablation studies demonstrating the effectiveness of the proposed method.

Weaknesses

The description of related works for unified multimodal models could be more detailed to better position the contributions of this work. The method requires finetuning for the pixel decoder in both stages; the reason for this design choice is not well explained.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies