Modeling Cross-vision Synergy for Unified Large Vision Model

Shengqiong Wu; Lanhu Wu; Mingyang Bao; Wenhao Xu; Hanwang Zhang; Shuicheng Yan; Hao Fei; Tat-Seng Chua

arXiv:2603.03564·cs.CV·March 5, 2026

Modeling Cross-vision Synergy for Unified Large Vision Model

Shengqiong Wu, Lanhu Wu, Mingyang Bao, Wenhao Xu, Hanwang Zhang, Shuicheng Yan, Hao Fei, Tat-Seng Chua

PDF

Open Access

TL;DR

PolyV is a unified large vision model that leverages architectural design and training strategies to enable cross-vision synergy, improving reasoning across images, videos, and 3D data beyond existing models.

Contribution

The paper introduces PolyV, a novel unified LVM with a sparse Mixture-of-Experts architecture and a synergy-aware training paradigm for cross-vision reasoning.

Findings

01

PolyV outperforms existing models by over 10% on 10 benchmarks.

02

The architecture enables modality-specific expertise with bidirectional interaction.

03

Synergy-aware training enhances reasoning across visual modalities.

Abstract

Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications