Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive   Backbone Ensembling

Cristian Rodriguez-Opazo; Ehsan Abbasnejad; Damien Teney and; Hamed Damirchi; Edison Marrese-Taylor; Anton van den Hengel

arXiv:2405.17139·cs.CV·February 18, 2025

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney and, Hamed Damirchi, Edison Marrese-Taylor, Anton van den Hengel

PDF

Open Access

TL;DR

This paper investigates the differences among CLIP-trained backbones, revealing their unique strengths and proposing an adaptive ensemble method that significantly boosts image classification accuracy across diverse datasets.

Contribution

It introduces an adaptive backbone ensembling approach that leverages backbone diversity to improve CLIP-based image classification performance.

Findings

01

Backbones have distinct representations and robustness properties.

02

Adaptive ensembling improves accuracy by up to 39.1%.

03

Performance gains surpass traditional ensemble methods.

Abstract

Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT Impact and Policies · Advanced Optical Network Technologies

MethodsContrastive Language-Image Pre-training