Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye

TL;DR
This paper introduces an end-to-end pose-based model for sign language spotting, enabling detection of specific signs within continuous videos without relying on intermediate text recognition, thus advancing sign language retrieval.
Contribution
The paper presents the first end-to-end pose-based architecture for sign language spotting, bypassing traditional gloss recognition and reducing computational costs.
Findings
Achieved 61.88% accuracy on the Word Presence Prediction dataset.
Demonstrated the effectiveness of pose representations over raw RGB data.
Established a new baseline for sign language retrieval tasks.
Abstract
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
