Visual Bridge: Universal Visual Perception Representations Generating

Yilin Gao; Shuguang Dou; Junzhou Li; Zhiheng Yu; Yin Li; Dongsheng Jiang; Shugong Xu

arXiv:2511.07877·cs.CV·November 12, 2025

Visual Bridge: Universal Visual Perception Representations Generating

Yilin Gao, Shuguang Dou, Junzhou Li, Zhiheng Yu, Yin Li, Dongsheng Jiang, Shugong Xu

PDF

Open Access

TL;DR

This paper introduces a universal visual perception framework based on flow matching that can generate diverse representations across multiple tasks, improving generalization and scalability in computer vision.

Contribution

It proposes a novel flow-matching approach using a universal velocity field to unify multiple vision tasks within a single model, inspired by large language models.

Findings

01

Achieves competitive performance in classification, detection, segmentation, depth estimation, and image-text retrieval.

02

Outperforms prior generalist and some specialist models in zero-shot and fine-tuned settings.

03

Demonstrates robustness, scalability, and strong generalization capabilities.

Abstract

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning