Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

Youngjoon Jang; Liliane Momeni; Zifan Jiang; Joon Son Chung; G\"ul Varol; Andrew Zisserman

arXiv:2512.08040·cs.CV·December 10, 2025

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, G\"ul Varol, Andrew Zisserman

PDF

Open Access

TL;DR

This paper introduces a unified, scalable model for sign language translation and subtitle alignment that leverages multilingual pretraining and a novel architecture to achieve state-of-the-art results and cross-linguistic generalization.

Contribution

The paper presents a new multi-task model combining visual feature extraction, a Sliding Perceiver network, and scalable training for sign language translation and alignment, with multilingual pretraining.

Findings

01

Achieves state-of-the-art results on BSL dataset for SLT and SSA.

02

Demonstrates robust zero-shot generalization to ASL.

03

Effective multilingual pretraining enhances cross-linguistic performance.

Abstract

Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Interactive and Immersive Displays