Vision Bridge Transformer at Scale
Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang

TL;DR
This paper presents the Vision Bridge Transformer (ViBT), a large-scale model that directly learns data-to-data translation for image and video tasks, demonstrating scalable and efficient conditional generation.
Contribution
It introduces a scalable Transformer-based Bridge Model with a novel velocity-matching objective for improved training and performance in image and video translation.
Findings
Effective at 20B and 1.3B parameters for image/video translation
Supports instruction-based image editing and complex video translation
Outperforms traditional diffusion models in data-to-data tasks
Abstract
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices · Multimodal Machine Learning Applications
