SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for   Multi-modal Large Language Models

Ziyi Lin; Chris Liu; Renrui Zhang; Peng Gao; Longtian Qiu; Han Xiao,; Han Qiu; Chen Lin; Wenqi Shao; Keqin Chen; Jiaming Han; Siyuan Huang; Yichi; Zhang; Xuming He; Hongsheng Li; Yu Qiao

arXiv:2311.07575·cs.CV·November 14, 2023·24 cites

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao,, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi, Zhang, Xuming He, Hongsheng Li, Yu Qiao

PDF

Open Access 1 Repo 6 Models

TL;DR

SPHINX introduces a multi-modal large language model that combines weights, tasks, and visual embeddings through joint mixing, enhancing multi-modal understanding and reasoning across diverse applications.

Contribution

It proposes a novel joint mixing strategy for weights, tasks, and visual embeddings, improving robustness and multi-modal capabilities of large language models.

Findings

01

Superior multi-modal understanding across various tasks

02

Enhanced visual parsing and reasoning on benchmarks

03

Effective integration of diverse visual and task-specific information

Abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alpha-vllm/llama2-accessory
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsVisual Parsing