ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models   with Enhanced Adapter

Zhengqing Yuan; Yunhong He; Kun Wang; Yanfang Ye; Lichao Sun

arXiv:2305.07490·cs.CL·April 8, 2024·2 cites

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

Zhengqing Yuan, Yunhong He, Kun Wang, Yanfang Ye, Lichao Sun

PDF

Open Access 1 Repo

TL;DR

ArtGPT-4 introduces a specialized vision-language model with adapter layers for improved artistic image understanding, achieving state-of-the-art results efficiently on artistic datasets with minimal fine-tuning.

Contribution

The paper presents ArtGPT-4, a novel large vision-language model that enhances artistic comprehension using adapter layers, enabling efficient training and superior performance.

Findings

01

Efficient training within 2 hours on a Tesla A100.

02

State-of-the-art performance on ArtEmis datasets.

03

Negligible gap to professional artists' descriptions.

Abstract

The success of large language models (LLMs) has inspired an emerging research field of multimodal learning. However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters. To tackle this challenge, models such as MiniGPT-4 and LLaVA have been developed to fine-tune the pre-trained models using fewer parameters. Despite their promising performance, these models remain limited in their understanding of artistic imagery. To facilitate better artistic-understanding, in this paper, we propose ArtGPT-4, a pioneering large vision-language model tailored to address the limitations of existing models in artistic comprehension. The key innovation of ArtGPT-4 lies in its craft for the sophisticated challenge of artistic image comprehension, setting it apart from other models that overlook fine details for broader…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dlyuangod/artgpt-4
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Aesthetic Perception and Analysis · Visual Attention and Saliency Detection

MethodsAdapter · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Absolute Position Encodings · Softmax · Layer Normalization