GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on   Perception Task

Ning Ding; Yehui Tang; Zhongqian Fu; Chao Xu; Kai Han; Yunhe Wang

arXiv:2306.00693·cs.CV·February 28, 2025·2 cites

GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces GPT4Image, a framework that leverages large pre-trained multimodal models to enhance the representation learning of traditional vision models like CNNs and ViTs on perception tasks, improving their performance.

Contribution

The paper proposes a novel method to incorporate rich semantic knowledge from pre-trained large models into vision models via text embeddings, boosting their perception capabilities.

Findings

01

Enhanced image classification accuracy across multiple architectures.

02

Effective use of text embeddings as additional supervision signals.

03

Improved generalization on various visual perception benchmarks.

Abstract

The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huawei-noah/Efficient-Computing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques