SOLO: A Single Transformer for Scalable Vision-Language Modeling

Yangyi Chen; Xingyao Wang; Hao Peng; Heng Ji

arXiv:2407.06438·cs.CV·December 17, 2024·1 cites

SOLO: A Single Transformer for Scalable Vision-Language Modeling

Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji

PDF

Open Access 1 Repo 1 Models

TL;DR

SOLO introduces a unified single transformer architecture for scalable vision-language modeling, addressing key limitations of heterogeneous models, and provides an open-source training recipe for developing large-scale LVLMs with competitive performance.

Contribution

This paper presents the first open-source training recipe for SOLO, a single transformer LVLM, enabling scalable and stable training of billion-scale models from moderate resources.

Findings

01

SOLO achieves performance comparable to LLaVA-v1.5-7B.

02

It excels in visual mathematical reasoning.

03

The training recipe facilitates stable training of large models.

Abstract

We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components - visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangyi-chen/solo
pytorchOfficial

Models

🤗
YangyiYY/SOLO-7B
model· 37 dl· ♡ 5
37 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam