HandReader: Advanced Techniques for Efficient Fingerspelling Recognition
Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev

TL;DR
HandReader introduces three innovative architectures utilizing RGB and keypoint data with novel modules, achieving state-of-the-art accuracy in fingerspelling recognition across multiple datasets, including a new Russian dataset.
Contribution
The paper presents three new architectures for fingerspelling recognition, including novel modules TSAM and TPE, and demonstrates their effectiveness on multiple datasets, including a new Russian dataset.
Findings
Achieved state-of-the-art results on ChicagoFSWild datasets.
Demonstrated high performance on the new Znaki Russian fingerspelling dataset.
Proposed novel modules TSAM and TPE for improved temporal and spatial feature processing.
Abstract
Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial…
Peer Reviews
Decision·Submitted to ICLR 2026
A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. 1. The paper introduces TSAM, a novel module that effectively ha
A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 1. While the HandReader_KP model is very fast on a CPU (3.9ms, Table 5), the best-performing RGB (51.4
1. The authors achieve state-of-the-art results on both ChicagoFSWild and ChicagoFSWild+ benchmarks, which demonstrates the effectiveness of their input processing and pipeline. 2. They also introduce a new dataset, containing a large number of high-resolution videos, which is helpful to the sign language community.
1.Some sections of the paper are not fully clear and easy to follow. 2. The proposed method mainly leverages existing architectures to solve the task without significant modifications. For the RGB video processing the extension of the Temporal Shift-Adaptive Module (TSAM) for varying video lengths is a useful addition but is not in my opinion a major modification of the existing (TSAM) architecture. Also, organizing key points as tensors and using them as input to a convolutional encoder, to t
The main strength of the paper seems to be the Znaki dataset for Russian fingerspelling. The collection of the dataset is well documented but it is not clear if the dataset will be made available to researchers. Also another small strength is the engineering behind making the TSM module adaptable to variable length sequences, although the actual gains are minor.
Overall the novelty of the paper is very weak. TSAM seems a small engineering trick behind making TSM adapt to variable length sequences and the ablation result in Table 7 of the supplementary seems a very small increase in terms of letter accuracy. Also we only have letter accuracy as metrics - no CER/WER, word accuracy, S/D/I breakdown, or confusion matrices. Furthermore, another contribution stated is the combination of the keypoints as a three-dimension tensor, however this has been done mu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Handwritten Text Recognition Techniques
Methods3D Convolution · Convolution
