Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
Ozge Mercanoglu Sincan, Richard Bowden

TL;DR
This paper introduces a dual visual encoder framework for gloss-free sign language translation, leveraging contrastive pretraining to improve translation accuracy without relying on costly gloss annotations.
Contribution
The work presents a novel dual encoder architecture with contrastive pretraining for sign language translation, outperforming existing gloss-free methods on benchmark data.
Findings
Outperforms single encoder models on Phoenix-2014T
Achieves highest BLEU-4 score among gloss-free approaches
Demonstrates effectiveness of contrastive pretraining in SLT
Abstract
Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Tactile and Sensory Interactions
