Selective Contrastive Learning For Gloss Free Sign Language Translation
Changhao Lai, Rui Zhao, Xuewen Zhong, Jinsong Su, Yidong Chen

TL;DR
This paper introduces Selective Contrastive Learning for Sign Language Translation, improving cross-modal alignment by intelligently selecting negatives based on similarity dynamics, addressing noise issues in current contrastive methods.
Contribution
It proposes a novel Pair Selection strategy that enhances contrastive learning by focusing on challenging negatives, leading to more effective SLT training.
Findings
Selective contrastive learning improves translation accuracy.
The Pair Selection strategy reduces noise from invalid negatives.
Curriculum-based mini-batch construction enhances training effectiveness.
Abstract
Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
