A Tale of Two Languages: Large-Vocabulary Continuous Sign Language   Recognition from Spoken Language Supervision

Charles Raude; K R Prajwal; Liliane Momeni; Hannah Bull; Samuel; Albanie; Andrew Zisserman; G\"ul Varol

arXiv:2405.10266·cs.CV·May 17, 2024

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel, Albanie, Andrew Zisserman, G\"ul Varol

PDF

Open Access

TL;DR

This paper introduces CSLR2, a multi-task Transformer model that jointly learns sign language recognition and retrieval, utilizing new annotations and weak supervision to significantly outperform previous methods.

Contribution

The paper presents a novel multi-task Transformer model for large-vocabulary sign language recognition and retrieval, with new dataset annotations and a training strategy leveraging weak supervision.

Findings

01

Model outperforms previous state-of-the-art on both tasks.

02

Joint training improves performance for both recognition and retrieval.

03

Weak supervision from subtitles enhances model accuracy.

Abstract

In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Gait Recognition and Analysis

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout · Softmax