Vision Bridge Transformer at Scale

Zhenxiong Tan; Zeqing Wang; Xingyi Yang; Songhua Liu; Xinchao Wang

arXiv:2511.23199·cs.CV·December 1, 2025

Vision Bridge Transformer at Scale

Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang

PDF

Open Access 1 Models

TL;DR

This paper presents the Vision Bridge Transformer (ViBT), a large-scale model that directly learns data-to-data translation for image and video tasks, demonstrating scalable and efficient conditional generation.

Contribution

It introduces a scalable Transformer-based Bridge Model with a novel velocity-matching objective for improved training and performance in image and video translation.

Findings

01

Effective at 20B and 1.3B parameters for image/video translation

02

Supports instruction-based image editing and complex video translation

03

Outperforms traditional diffusion models in data-to-data tasks

Abstract

We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Yuanshi/ViBT
model· 213 dl· ♡ 19
213 dl♡ 19

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices · Multimodal Machine Learning Applications