OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang; Sucheng Ren; Yanqing Liu; Xianhang Li; Zeyu Wang; Yuyin Zhou; Huaxiu Yao; Zeyu Zheng; Weili Nie; Guilin Liu; Zhiding Yu; Cihang Xie

arXiv:2601.15369·eess.IV·March 16, 2026

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie

PDF

Open Access

TL;DR

OpenVision 3 introduces a unified visual encoder that effectively supports both image understanding and generation by jointly optimizing reconstruction and semantic features in a shared latent space, demonstrating strong performance across tasks.

Contribution

The paper presents a novel unified visual encoder architecture that combines image understanding and generation capabilities within a single model, leveraging VAE-compressed latents and joint training objectives.

Findings

01

Surpasses standard CLIP-based encoders in image generation quality (gFID: 1.87 vs. 2.54).

02

Performs comparably with standard CLIP encoders in multimodal understanding tasks.

03

Empirically demonstrates mutual benefits of generation and understanding in a unified architecture.

Abstract

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning