SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation
JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR
This paper introduces SAGE, a segment-aware encoding method for sign language translation that reduces model complexity and improves scalability by using sign segmentation and contrastive alignment, outperforming previous methods.
Contribution
The paper proposes a novel segment-aware visual tokenization framework and a contrastive alignment objective, enhancing sign language translation without gloss annotations and reducing computational demands.
Findings
Reduces input sequence length by up to 50%.
Achieves 2.67x lower memory usage.
Outperforms state-of-the-art on PHOENIX14T.
Abstract
Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
