Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

Mohammed Baharoon; Jonathan Klein; Dominik L. Michels

arXiv:2405.14239·cs.LG·June 24, 2025

Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

Mohammed Baharoon, Jonathan Klein, Dominik L. Michels

PDF

Open Access 1 Repo

TL;DR

Harmony is a novel framework that combines vision-language contrastive learning with self-supervised methods to improve general visual representations, especially for dense prediction tasks, using web data without negative samples.

Contribution

It introduces a unified training framework that integrates multiple objectives and addresses key challenges in weakly-supervised visual learning from web data.

Findings

01

Outperforms baseline CLIP on various tasks

02

Surpasses previous joint self- and weakly-supervised methods

03

Effective in data-constrained scenarios

Abstract

Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mohammedsb/harmony
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Image Retrieval and Classification Techniques

MethodsContrastive Learning · Masked autoencoder · Contrastive Language-Image Pre-training