GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task
Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang

TL;DR
This paper introduces GPT4Image, a framework that leverages large pre-trained multimodal models to enhance the representation learning of traditional vision models like CNNs and ViTs on perception tasks, improving their performance.
Contribution
The paper proposes a novel method to incorporate rich semantic knowledge from pre-trained large models into vision models via text embeddings, boosting their perception capabilities.
Findings
Enhanced image classification accuracy across multiple architectures.
Effective use of text embeddings as additional supervision signals.
Improved generalization on various visual perception benchmarks.
Abstract
The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
