Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations
Mohammed Baharoon, Jonathan Klein, Dominik L. Michels

TL;DR
Harmony is a novel framework that combines vision-language contrastive learning with self-supervised methods to improve general visual representations, especially for dense prediction tasks, using web data without negative samples.
Contribution
It introduces a unified training framework that integrates multiple objectives and addresses key challenges in weakly-supervised visual learning from web data.
Findings
Outperforms baseline CLIP on various tasks
Surpasses previous joint self- and weakly-supervised methods
Effective in data-constrained scenarios
Abstract
Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Image Retrieval and Classification Techniques
MethodsContrastive Learning · Masked autoencoder · Contrastive Language-Image Pre-training
