SOLO: A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji

TL;DR
SOLO introduces a unified single transformer architecture for scalable vision-language modeling, addressing key limitations of heterogeneous models, and provides an open-source training recipe for developing large-scale LVLMs with competitive performance.
Contribution
This paper presents the first open-source training recipe for SOLO, a single transformer LVLM, enabling scalable and stable training of billion-scale models from moderate resources.
Findings
SOLO achieves performance comparable to LLaVA-v1.5-7B.
It excels in visual mathematical reasoning.
The training recipe facilitates stable training of large models.
Abstract
We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
