Multimodal Multilabel Classification by CLIP

Yanming Guo

arXiv:2406.16141·cs.CV·June 25, 2024·1 cites

Multimodal Multilabel Classification by CLIP

Yanming Guo

PDF

Open Access

TL;DR

This paper explores multimodal multilabel classification using CLIP, achieving high performance by fine-tuning with various methods, and demonstrates its effectiveness through competitive results and detailed analysis.

Contribution

It introduces a novel approach leveraging CLIP for multimodal multilabel classification, including new training techniques and comprehensive experimental evaluation.

Findings

01

Achieved over 90% F1 score in Kaggle competition

02

Demonstrated effectiveness of CLIP-based fine-tuning methods

03

Provided detailed analysis of fusion and loss functions

Abstract

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLexicography and Language Studies · Natural Language Processing Techniques