Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

Samuel Ebimobowei Johnny; Blessed Guda; Emmanuel Enejo Aaron; Assane Gueye

arXiv:2512.08738·cs.CV·December 10, 2025

Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye

PDF

Open Access

TL;DR

This paper introduces an end-to-end pose-based model for sign language spotting, enabling detection of specific signs within continuous videos without relying on intermediate text recognition, thus advancing sign language retrieval.

Contribution

The paper presents the first end-to-end pose-based architecture for sign language spotting, bypassing traditional gloss recognition and reducing computational costs.

Findings

01

Achieved 61.88% accuracy on the Word Presence Prediction dataset.

02

Demonstrated the effectiveness of pose representations over raw RGB data.

03

Established a new baseline for sign language retrieval tasks.

Abstract

Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition