MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large   Language Models

Deyao Zhu; Jun Chen; Xiaoqian Shen; Xiang Li; Mohamed Elhoseiny

arXiv:2304.10592·cs.CV·October 3, 2023·475 cites

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

PDF

Open Access 5 Repos 2 Models 1 Datasets

TL;DR

MiniGPT-4 demonstrates that aligning a frozen visual encoder with an advanced large language model enables multi-modal abilities similar to GPT-4, including detailed image descriptions, creative writing, and website generation from sketches.

Contribution

This work is the first to show that proper alignment of visual features with a sophisticated LLM can unlock advanced multi-modal capabilities, bridging vision and language understanding.

Findings

01

MiniGPT-4 can generate detailed image descriptions and creative content.

02

Fine-tuning with a curated dataset improves generation quality and reliability.

03

The approach reveals that large language models can be effectively adapted for vision-language tasks.

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

zai-org/CogVLM-SFT-311K
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection