MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia, Parcalabescu, Anette Frank

TL;DR
MAGMA introduces a simple, adapter-based finetuning method that enhances generative language models with multimodal capabilities, achieving state-of-the-art results with minimal pretraining data and preserving language knowledge.
Contribution
It presents a novel end-to-end multimodal finetuning approach that maintains language model weights, enabling efficient training and transfer of pretraining knowledge.
Findings
Outperforms Frozen on open-ended generative tasks
Achieves state-of-the-art on OKVQA benchmark
Requires significantly less pretraining data
Abstract
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsSimple Visual Language Model
