InfMLLM: A Unified Framework for Visual-Language Tasks

Qiang Zhou; Zhibin Wang; Wei Chu; Yinghui Xu; Hao Li; Yuan Qi

arXiv:2311.06791·cs.CV·December 7, 2023·1 cites

InfMLLM: A Unified Framework for Visual-Language Tasks

Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi

PDF

Open Access 2 Repos

TL;DR

InfMLLM is a unified multimodal language model that effectively handles vision-language tasks like captioning, VQA, and grounding through a three-stage training process and a novel visual adapter, achieving state-of-the-art results.

Contribution

This work introduces InfMLLM, a new multimodal language model with a simple visual adapter and a three-stage training scheme for improved vision-language task performance.

Findings

01

Achieves state-of-the-art or comparable results on benchmark datasets.

02

The pool-adapter effectively preserves positional information of visual embeddings.

03

Three-stage training enhances instruction-following capabilities.

Abstract

Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsAdapter