Read and Attend: Temporal Localisation in Sign Language Videos

G\"ul Varol; Liliane Momeni; Samuel Albanie; Triantafyllos Afouras,; Andrew Zisserman

arXiv:2103.16481·cs.CV·March 31, 2021·1 cites

Read and Attend: Temporal Localisation in Sign Language Videos

G\"ul Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras,, Andrew Zisserman

PDF

Open Access

TL;DR

This paper presents a Transformer-based approach to localize and annotate signs in continuous sign language videos using weakly-aligned subtitles, significantly advancing large-scale sign language recognition.

Contribution

It introduces a method to leverage weakly-aligned subtitles for sign localization, automatically generate annotations, and improve recognition performance on a large benchmark.

Findings

01

Successful sign localization in continuous videos

02

Automatic annotation of large sign vocabulary

03

Outperforms previous state-of-the-art on BSL-1K

Abstract

The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Hearing Impairment and Communication

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Attention Is All You Need · Dropout · Residual Connection · Byte Pair Encoding · Layer Normalization