Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li

TL;DR
This paper introduces X-FM, a unified foundation model designed to excel across language, vision, and vision-language tasks, demonstrating superior performance through novel training techniques and a multi-encoder architecture.
Contribution
The paper presents a new general foundation model with a unified architecture and innovative training methods that enable it to perform well across diverse understanding tasks.
Findings
X-FM outperforms existing general foundation models.
X-FM matches or exceeds specialized models in various tasks.
The model demonstrates strong cross-modal understanding capabilities.
Abstract
Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
