Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment
Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, G\"ul Varol, Andrew Zisserman

TL;DR
This paper introduces a unified, scalable model for sign language translation and subtitle alignment that leverages multilingual pretraining and a novel architecture to achieve state-of-the-art results and cross-linguistic generalization.
Contribution
The paper presents a new multi-task model combining visual feature extraction, a Sliding Perceiver network, and scalable training for sign language translation and alignment, with multilingual pretraining.
Findings
Achieves state-of-the-art results on BSL dataset for SLT and SSA.
Demonstrates robust zero-shot generalization to ASL.
Effective multilingual pretraining enhances cross-linguistic performance.
Abstract
Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Interactive and Immersive Displays
