Scaling Sign Language Translation
Biao Zhang, Garrett Tanzer, Orhan Firat

TL;DR
This paper advances sign language translation by scaling data, models, and translation directions, achieving significant improvements and demonstrating zero-shot capabilities across multiple languages.
Contribution
It introduces a large-scale, unified pretraining approach for SLT using diverse data sources and model scaling, enabling open-domain translation and zero-shot transfer.
Findings
Significant performance improvements over baselines.
Successful zero-shot sign language translation.
Effective cross-lingual and cross-modal transfer.
Abstract
Sign language translation (SLT) addresses the problem of translating information from a sign language in video to a spoken language in text. Existing studies, while showing progress, are often limited to narrow domains and/or few sign languages and struggle with open-domain tasks. In this paper, we push forward the frontier of SLT by scaling pretraining data, model size, and number of translation directions. We perform large-scale SLT pretraining on different data including 1) noisy multilingual YouTube SLT data, 2) parallel text corpora, and 3) SLT data augmented by translating video captions to other languages with off-the-shelf machine translation models. We unify different pretraining tasks with task-specific prompts under the encoder-decoder architecture, and initialize the SLT model with pretrained (m/By)T5 models across model sizes. SLT pretraining results on How2Sign and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Swearing, Euphemism, Multilingualism
