Exploiting the relationship between visual and textual features in   social networks for image classification with zero-shot deep learning

Luis Lucas; David Tomas; Jose Garcia-Rodriguez

arXiv:2107.03751·cs.CV·July 9, 2021

Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning

Luis Lucas, David Tomas, Jose Garcia-Rodriguez

PDF

TL;DR

This paper explores how combining visual and textual features from social media can enhance zero-shot image classification using CLIP, demonstrating improved accuracy with minimal fine-tuning.

Contribution

It introduces a multimodal ensemble classifier leveraging CLIP's transfer learning for social media images and texts, showing improved classification performance.

Findings

01

Adding associated texts improves accuracy.

02

CLIP can be effectively applied with minimal fine-tuning.

03

Multimodal approach is promising for social media image classification.

Abstract

One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For this purpose, we used the InstaNY100K dataset and proposed a validation approach based on sampling techniques. Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part, and then adding the associated texts as support. The results obtained demonstrated that trained neural networks such as CLIP can be successfully applied to image classification with little fine-tuning, and considering the associated texts to the images can help to improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training