Group Contrastive Learning for Weakly Paired Multimodal Data
Aditya Gorla, Hugues Van Assel, Jan-Christian Huetter, Heming Yao, Kyunghyun Cho, Aviv Regev, Russell Littman

TL;DR
GROOVE introduces a semi-supervised multi-modal learning method with a novel group-level contrastive loss, effectively handling weakly paired data and improving cross-modal representation tasks in high-content perturbation datasets.
Contribution
The paper proposes GroupCLIP, a new contrastive loss for weakly paired multimodal data, integrated with an autoencoder framework, and introduces a comprehensive evaluation framework for such methods.
Findings
GROOVE performs on par or better than existing methods in real datasets.
GroupCLIP is identified as the key component for performance improvements.
The evaluation framework reveals no single aligner dominates across all settings.
Abstract
We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing…
Peer Reviews
Decision·Submitted to ICLR 2026
- GroupCLIP is an extension that combines SupCon (using supervised class-labels for contrastive learning) and CLIP (for cross-modal alignment). Technically, it is a straightforward extension, but I find this simplicity a strength rather than a weakness. - The motivation of the work is clear and addresses an important gap; multimodal methods that can leverage weakly paired data are crucial for biological applications where true paired measurements are experimentally infeasible - A well-motivate
- W1: The contributions of the paper are minimal. Besides GroupCLIP, the second contribution is the backtranslating autoencoder. This, however, doesn’t seem to have any positive effect on GROOVE’s performance. - W2: The experimental analysis is limited to only two baselines. The authors discuss many more methods in the related work, but it’s unclear how these methods differ from GROOVE and why they weren’t chosen for benchmarking (for instance, Samaran et al, 2024) - W3: Inconclusive findings wr
1) The paper tackles the important and practical challenge of learning from weakly paired multimodal data (where only group labels connect modalities), a common scenario in biological perturbation screens 2) The core contribution, the GroupCLIP loss effectively bridges the gap between cross-modal contrastive learning and uni-modal supervised contrastive learning for this specific weakly paired setting 3) The proposed method, GROOVE, outperforms the most comparable methods on real single-cell d
1) Many of the performance differences reported in the simulation results (Table 1 Bary. FOSCTTM, Table 2) appear small and likely not statistically significant given the overlapping standard errors. The authors should provide a statistical test to show significant improvement. 2) The paper doesn't provide any analysis on a) sensitivity to hyperparameters alpha and beta that balance the GroupCLIP and reconstruction/backtranslation losses. It's unclear if the chosen values generalize or require
- The paper addresses a realistic regime where modalities cannot be co-measured on the same cell, making group-level supervision both natural and necessary. - The paper is well written and easy to follow - The evaluation design separates representation learning from alignment by sweeping multiple labeled OT variants, which reduces confounding and yields a more credible comparison across methods. - The ablations are clear and show that removing the group-contrastive term degrades performance co
- CLIP’s web pairs are often weak and effectively many-to-one (e.g., many different dog images paired with near-identical captions), so large-scale CLIP training already approximates a group-level supervision regime rather than strict instance pairing. Thus it is important for the paper to show clear advantages in regimes where per-instance captions carry little unique information beyond a coarse label. The key claimed difference is that GROOVE does not need any per-instance pairing at all, wher
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
