HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

Pavel Korotaev; Petr Surovtsev; Alexander Kapitanov; Karina Kvanchiani; Aleksandr Nagaev

arXiv:2505.10267·cs.CV·May 16, 2025

HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev

PDF

Open Access 1 Repo 3 Reviews

TL;DR

HandReader introduces three innovative architectures utilizing RGB and keypoint data with novel modules, achieving state-of-the-art accuracy in fingerspelling recognition across multiple datasets, including a new Russian dataset.

Contribution

The paper presents three new architectures for fingerspelling recognition, including novel modules TSAM and TPE, and demonstrates their effectiveness on multiple datasets, including a new Russian dataset.

Findings

01

Achieved state-of-the-art results on ChicagoFSWild datasets.

02

Demonstrated high performance on the new Znaki Russian fingerspelling dataset.

03

Proposed novel modules TSAM and TPE for improved temporal and spatial feature processing.

Abstract

Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader $_{R GB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader $_{K P}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. 1. The paper introduces TSAM, a novel module that effectively ha

Weaknesses

A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 1. While the HandReader_KP model is very fast on a CPU (3.9ms, Table 5), the best-performing RGB (51.4

Reviewer 02Rating 4Confidence 4

Strengths

1. The authors achieve state-of-the-art results on both ChicagoFSWild and ChicagoFSWild+ benchmarks, which demonstrates the effectiveness of their input processing and pipeline. 2. They also introduce a new dataset, containing a large number of high-resolution videos, which is helpful to the sign language community.

Weaknesses

1.Some sections of the paper are not fully clear and easy to follow. 2. The proposed method mainly leverages existing architectures to solve the task without significant modifications. For the RGB video processing the extension of the Temporal Shift-Adaptive Module (TSAM) for varying video lengths is a useful addition but is not in my opinion a major modification of the existing (TSAM) architecture. Also, organizing key points as tensors and using them as input to a convolutional encoder, to t

Reviewer 03Rating 0Confidence 3

Strengths

The main strength of the paper seems to be the Znaki dataset for Russian fingerspelling. The collection of the dataset is well documented but it is not clear if the dataset will be made available to researchers. Also another small strength is the engineering behind making the TSM module adaptable to variable length sequences, although the actual gains are minor.

Weaknesses

Overall the novelty of the paper is very weak. TSAM seems a small engineering trick behind making TSM adapt to variable length sequences and the ablation result in Table 7 of the supplementary seems a very small increase in terms of letter accuracy. Also we only have letter accuracy as metrics - no CER/WER, word accuracy, S/D/I breakdown, or confusion matrices. Furthermore, another contribution stated is the combination of the keypoints as a three-dimension tensor, however this has been done mu

Code & Models

Repositories

ai-forever/handreader
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Handwritten Text Recognition Techniques

Methods3D Convolution · Convolution