TL;DR
This paper presents a unified learning framework for sign spotting in videos, leveraging multiple supervision sources including subtitles and dictionaries, validated on low-shot benchmarks and supported by a new BSL dictionary dataset.
Contribution
It introduces a novel integrated approach combining various supervision types for sign spotting and provides a new BSL dictionary dataset for research.
Findings
Effective sign spotting achieved on low-shot benchmarks.
Unified framework improves sign detection accuracy.
BSLDict dataset facilitates future research in sign language recognition.
Abstract
The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
