Wings: Learning Multimodal LLMs without Text-only Forgetting
Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu,, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

TL;DR
Wings introduces a novel multimodal large language model that maintains strong performance in both text-only and multimodal tasks by addressing attention shifts with dedicated visual and textual learners.
Contribution
The paper proposes Wings, a new MLLM architecture with parallel visual and textual learners that prevent forgetting and improve multimodal and text-only task performance.
Findings
Wings outperforms comparable MLLMs in text-only and visual question-answering tasks.
The model effectively balances attention between visual and textual inputs.
Wings achieves superior results on the new Interleaved Image-Text benchmark.
Abstract
Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
MethodsFocus · ALIGN
