Design of the topology for contrastive visual-textual alignment
Zhun Sun

TL;DR
This paper investigates the role of the softmax temperature in contrastive visual-textual alignment and proposes a new topology using an oblique manifold to improve zero-shot classification performance.
Contribution
It introduces a novel topology for embedding alignment using an oblique manifold and demonstrates its effectiveness in enhancing zero-shot classification accuracy.
Findings
Improved zero-shot classification performance by an average of 6.1%.
Highlights the softmax temperature as a key factor in contrastive learning on noisy data.
Proposes a topology that better captures the embedding space structure for contrastive tasks.
Abstract
Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. In this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsSoftmax
