InfMLLM: A Unified Framework for Visual-Language Tasks
Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi

TL;DR
InfMLLM is a unified multimodal language model that effectively handles vision-language tasks like captioning, VQA, and grounding through a three-stage training process and a novel visual adapter, achieving state-of-the-art results.
Contribution
This work introduces InfMLLM, a new multimodal language model with a simple visual adapter and a three-stage training scheme for improved vision-language task performance.
Findings
Achieves state-of-the-art or comparable results on benchmark datasets.
The pool-adapter effectively preserves positional information of visual embeddings.
Three-stage training enhances instruction-following capabilities.
Abstract
Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAdapter
