Aligning Subtitles in Sign Language Videos

Hannah Bull; Triantafyllos Afouras; G\"ul Varol; Samuel Albanie,; Liliane Momeni; Andrew Zisserman

arXiv:2105.02877·cs.CV·May 7, 2021

Aligning Subtitles in Sign Language Videos

Hannah Bull, Triantafyllos Afouras, G\"ul Varol, Samuel Albanie,, Liliane Momeni, Andrew Zisserman

PDF

Open Access

TL;DR

This paper introduces a Transformer-based model that accurately aligns subtitles with sign language videos at the frame level, enabling improved synchronization for sign language translation.

Contribution

We develop a novel Transformer architecture utilizing BERT and CNN embeddings to localize entire subtitles in continuous sign language videos, surpassing previous keyword-based methods.

Findings

01

Significant improvement over baseline alignment methods

02

Effective use of BERT and CNN embeddings for encoding signals

03

Potential to enhance machine translation of sign languages

Abstract

The goal of this work is to temporally align asynchronous subtitles in sign language videos. In particular, we focus on sign-language interpreted TV broadcast data comprising (i) a video of continuous signing, and (ii) subtitles corresponding to the audio content. Previous work exploiting such weakly-aligned data only considered finding keyword-sign correspondences, whereas we aim to localise a complete subtitle text in continuous signing. We propose a Transformer architecture tailored for this task, which we train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals, which interact through a series of attention layers. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Subtitles and Audiovisual Media

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Dropout · Softmax · WordPiece