MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

TL;DR
MiniGPT-4 demonstrates that aligning a frozen visual encoder with an advanced large language model enables multi-modal abilities similar to GPT-4, including detailed image descriptions, creative writing, and website generation from sketches.
Contribution
This work is the first to show that proper alignment of visual features with a sophisticated LLM can unlock advanced multi-modal capabilities, bridging vision and language understanding.
Findings
MiniGPT-4 can generate detailed image descriptions and creative content.
Fine-tuning with a curated dataset improves generation quality and reliability.
The approach reveals that large language models can be effectively adapted for vision-language tasks.
Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection
