How do Cross-View and Cross-Modal Alignment Affect Representations in   Contrastive Learning?

Thomas M. Hehn; Julian F.P. Kooij; Dariu M. Gavrila

arXiv:2211.13309·cs.CV·November 28, 2022

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

Thomas M. Hehn, Julian F.P. Kooij, Dariu M. Gavrila

PDF

Open Access

TL;DR

This paper investigates how cross-view and cross-modal alignment in contrastive learning influence visual representations, revealing that cross-modal alignment emphasizes depth cues over color and texture, affecting downstream tasks.

Contribution

It provides a comprehensive analysis of the effects of cross-view and cross-modal alignment on learned representations across multiple datasets and tasks.

Findings

01

Cross-modal alignment discards color and texture information.

02

Depth cues from pretraining improve depth prediction.

03

Cross-modal alignment yields more robust encoders for certain tasks.

Abstract

Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning