Toward Building General Foundation Models for Language, Vision, and   Vision-Language Understanding Tasks

Xinsong Zhang; Yan Zeng; Jipeng Zhang; Hang Li

arXiv:2301.05065·cs.CV·October 18, 2023

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces X-FM, a unified foundation model designed to excel across language, vision, and vision-language tasks, demonstrating superior performance through novel training techniques and a multi-encoder architecture.

Contribution

The paper presents a new general foundation model with a unified architecture and innovative training methods that enable it to perform well across diverse understanding tasks.

Findings

01

X-FM outperforms existing general foundation models.

02

X-FM matches or exceeds specialized models in various tasks.

03

The model demonstrates strong cross-modal understanding capabilities.

Abstract

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangxinsong-nlp/XFM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques