Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao,, Wanli Ouyang, Xiangyu Yue

TL;DR
Meta-Transformer introduces a unified multimodal learning framework that uses a frozen encoder to process 12 different modalities without paired training data, enabling versatile perception and data analysis tasks.
Contribution
It is the first framework to perform unified multimodal learning across 12 modalities using a frozen encoder and unpaired data, bridging gaps among diverse data types.
Findings
Handles 12 modalities including text, images, point clouds, and more.
Achieves competitive performance on various perception and data mining benchmarks.
Demonstrates the potential for unified multimodal intelligence with transformers.
Abstract
Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ( natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1 To address the challenge of learning from multiple modalities, the authors propose a unified pipeline that includes a modality-specialist for data-to-sequence tokenization, a modality-shared encoder for extracting representations across modalities, and task-specific heads for downstream tasks. This provides a comprehensive solution for multimodal learning. 2 To showcase the capabilities of the Transformer in multimodal learning, a wide range of modalities and tasks were utilized for training,
1 There appears to be a discrepancy in the description of the tokenization process. In Figure 3, it shows the use of a 1x1 convolution for feature dimension mapping during tokenization. However, on page 6, in the first line, it mentions the use of CLIP for learning word embeddings. This seems to be conflicting information. It's important to clarify and ensure consistency in the tokenization process described in the paper. 2 I think using the Visual Transformer (ViT) for encoder in pretraining a
1. It is interesting to see a model handling 12 modalities. 2. The proposed idea is straightforward. 3. The paper is presented overall clearly.
1. Despite the success of one model handling multiple modalities, the insight provided in rather limited. There are many important questions that are not really answered. a. Why using the meta-transformer in this pretrained manner? How about other pretraining manners on images? How about pretrained transformer in other modalities like text? b. The conclusion also touches a claim that transformer is the future universal architecture. However, other architectures are not really validated. On the
1. The paper is well-written and presents its ideas clearly. 2. The proposed Meta-Transformer framework demonstrates significant innovation and practicality in handling multimodal learning, especially with unpaired data. 3. The results provided across multiple benchmarks validate the effectiveness of your approach and cover a wide range of applications, providing evidence of the method's broad applicability and robustness. 4. The performance of Meta-Transformer on cross-modal retrieval, referrin
1. The model has achieved commendable results, but I believe that further scaling up the model could potentially yield even more intriguing outcomes. 2. It is noted that the base model parameters are frozen during the training of different tasks. Therefore, most of the model's capabilities actually stem from contrastive learning between images and text. I think this approach to model training is still quite distant from achieving a truly universal model, as contrastive learning largely focuses
1. The work addresses an important and interesting topic in multimodal learning and tries to cover up to 12 modalities. Additionally, I think it is well structured and easy to follow. 2. The authors conducted very extensive experiments and comparisons.
1. It is a bit difficult for me to draw the conclusion that the proposed method performs better than other baselines. For example, in Table 3, Table 5, Table 8, we can see a clear performance gap as compared to other baselines. 2. Another concern is about novelty. I feel the technical novelty is limited; a similar concept has been explored widely since [1]. The difference is mainly about the shared component. [1] Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th interna
+ For multimodal research, the paper proposes a novel framework, Meta-Transformer, which utilizes a unified encoder to simultaneously extract representations from multiple modalities with the same set of parameters. + For multimodal network design, the paper comprehensively examines the functions of transformer components (e.g. embeddings, tokenization) and encoders in processing various modalities. Meta-Transformer provides valuable insights and sparks a promising new direction in developing a
- The paper has very beautiful figures and conducts very hard work for 12 modalities, datasets, tasks, loss functions, heads. One of the major weaknesses of the paper is that, the novelty might be not enough for a top conference. There is no much innovation in the data to sequence tokenization. All the tokenization including patch embedding, word embedding, etc., are existing strategies. The framework of the method is a widely used ViT. I acknowledge the hard work of datasets and experiments aut
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
