Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon; Hasan Mahmud; Kamrul Hasan

arXiv:2506.04367·cs.CV·June 6, 2025

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan

PDF

Open Access 1 Models

TL;DR

This paper evaluates and compares fine-tuned video transformer models for Bangla Sign Language recognition, demonstrating their superior performance over traditional methods on small and large datasets, with implications for accessibility.

Contribution

It introduces a comprehensive benchmark of video transformer architectures for Bangla Sign Language recognition, including dataset standardization, augmentation, and signer-independent evaluation.

Findings

01

Video transformers outperform traditional methods in accuracy.

02

VideoMAE achieved 95.5% accuracy on BdSLW60.

03

Models show robustness across datasets and signer variations.

Abstract

Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Shawon16/VideoMAE_BdSLW401_20_epochs_p5_SR_10
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Interactive and Immersive Displays

MethodsTimeSformer