GeRA: Label-Efficient Geometrically Regularized Alignment
Dustin Klebe, Tal Shnitzer, Mikhail Yurochkin, Leonid Karlinsky,, Justin Solomon

TL;DR
GeRA is a semi-supervised method that improves the alignment of pretrained unimodal encoders using geometric regularization, reducing the need for large paired datasets and preserving local manifold structures.
Contribution
We introduce a modality-agnostic geometric regularization technique that enhances alignment of pretrained encoders with limited paired data by leveraging manifold geometry.
Findings
Significant improvement in speech-text and image-text alignment quality.
Effective with small amounts of paired data due to geometric regularization.
Outperforms leading baselines in alignment tasks.
Abstract
Pretrained unimodal encoders incorporate rich semantic information into embedding space structures. To be similarly informative, multi-modal encoders typically require massive amounts of paired data for alignment and training. We introduce a semi-supervised Geometrically Regularized Alignment (GeRA) method to align the embedding spaces of pretrained unimodal encoders in a label-efficient way. Our method leverages the manifold geometry of unpaired (unlabeled) data to improve alignment performance. To prevent distortions to local geometry during the alignment process, potentially disrupting semantic neighborhood structures and causing misalignment of unobserved pairs, we introduce a geometric loss term. This term is built upon a diffusion operator that captures the local manifold geometry of the unimodal pretrained encoders. GeRA is modality-agnostic and thus can be used to align…
Peer Reviews
Decision·Submitted to ICLR 2024
- The whole paper is well written and easy to follow. - To effeciently align embedding spaces of unimodal encoders by preserving the locality of unparied points is convincing. - The figures and charts are well-presented. Both Fig1 and Fig2 illustrates GeRA clearly.
- This paper has limited novelty. Adding a geometrically regularization term is too conventional in manifold learning. - The motivation of this paper needs further discussion. There are millions or even billions of paired speech-text and image-text data, why do we need a label-efficient semi-supervised method?
The proposed method stands out by focusing on preserving local geometric structures, which are critical for retaining the rich semantic information within the manifold structure. The method can capture additional information from pretrained unimodal encoders, making it highly valuable in scenarios where paired data is limited. does not rely on domain-specific knowledge or augmentation and can be applied across various encoders and data modalities, as long as pretrained models are available.
The author proposed the kernel based encoding methods for capturing the local geometric information of each sample. There are several existing works proposed in a while for capturing the local geometric in RKHS in semi-supervised settings, through either constructing neighbor data dependent norms or leveraging the Laplacian graphs in manifold regularization, list a few below: V. Sindhwani, et al. Beyond the point cloud: from transductive to semi-supervised learning X. Zhu, et al. Semi-su
- Results in Figure 3 and Figure 6 show that the proposed method is better than training with contrastive loss and two other baselines when training on fewer than 10^5 paired data points on both image-text and speech-text alignment tasks. - GeRA has been shown to be effective for two alignment learning tasks, image-text and speech-text alignment, in the low-data regime.
- One of the major motivations in the introduction for the method is to use unpaired data for training. However, I cannot find any experiment in section 5 that trains on a mixture of unpaired and paired data where paired data is small. If so, please name the dataset used in section 5. Is unpaired data referring to the data used for pretraining the models? If no experiments are done that use unpaired data during the alignment, at least the following sentence in the abstract should be corrected: “
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Human Pose and Action Recognition
MethodsALIGN · Diffusion
