SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao,, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi, Zhang, Xuming He, Hongsheng Li, Yu Qiao

TL;DR
SPHINX introduces a multi-modal large language model that combines weights, tasks, and visual embeddings through joint mixing, enhancing multi-modal understanding and reasoning across diverse applications.
Contribution
It proposes a novel joint mixing strategy for weights, tasks, and visual embeddings, improving robustness and multi-modal capabilities of large language models.
Findings
Superior multi-modal understanding across various tasks
Enhanced visual parsing and reasoning on benchmarks
Effective integration of diverse visual and task-specific information
Abstract
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceM4/idefics2-8b-basemodel· 1.6k dl· ♡ 281.6k dl♡ 28
- 🤗HuggingFaceM4/idefics2-8bmodel· 157k dl· ♡ 620157k dl♡ 620
- 🤗HuggingFaceM4/idefics2-8b-chattymodel· 70 dl· ♡ 9570 dl♡ 95
- 🤗Trelis/idefics2-8b-chatty-bf16model· 8 dl· ♡ 18 dl♡ 1
- 🤗huz-relay/idefics2-8b-ocrmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗peterpeter8585/ai2model· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsVisual Parsing
