Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan; Richard Bowden

arXiv:2507.10306·cs.CV·July 15, 2025

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan, Richard Bowden

PDF

Open Access

TL;DR

This paper introduces a dual visual encoder framework for gloss-free sign language translation, leveraging contrastive pretraining to improve translation accuracy without relying on costly gloss annotations.

Contribution

The work presents a novel dual encoder architecture with contrastive pretraining for sign language translation, outperforming existing gloss-free methods on benchmark data.

Findings

01

Outperforms single encoder models on Phoenix-2014T

02

Achieves highest BLEU-4 score among gloss-free approaches

03

Demonstrates effectiveness of contrastive pretraining in SLT

Abstract

Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Tactile and Sensory Interactions