Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi; Mohamed Ilyas Lakhal; Ozge Mercanoglu Sincan; Richard Bowden

arXiv:2507.23575·cs.CV·September 3, 2025

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi, Mohamed Ilyas Lakhal, Ozge Mercanoglu Sincan, Richard Bowden

PDF

TL;DR

BeyondGloss introduces a novel hand-centric, gloss-free sign language translation framework leveraging VideoLLMs and contrastive learning to improve fine-grained hand motion understanding and achieve state-of-the-art results.

Contribution

It proposes a new gloss-free SLT framework that enhances hand-specific modeling using VideoLLMs, contrastive alignment, and feature distillation, addressing limitations of existing models.

Findings

01

Achieves state-of-the-art results on Phoenix14T and CSL-Daily benchmarks.

02

Effectively models hand-centric temporal dynamics in sign language.

03

Reduces modality gap through contrastive pre-training.

Abstract

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.