Exploring Attention Mechanisms in Integration of Multi-Modal Information for Sign Language Recognition and Translation
Zaber Ibn Abdul Hakim, Rasman Mubtasim Swargo, Muhammad Abdullah Adnan

TL;DR
This paper introduces a lightweight cross-attention based plugin module for multi-modal sign language recognition and translation, improving accuracy while maintaining computational efficiency through a two-stage training process.
Contribution
The work proposes a novel, lightweight cross-attention plugin for multi-modal integration in sign language tasks, enabling end-to-end training and reducing computational complexity.
Findings
Reduced WER by 0.9 in recognition task
Increased BLEU-4 score by 0.8 in translation task
Efficient multi-modal feature merging with minimal overhead
Abstract
Understanding intricate and fast-paced movements of body parts is essential for the recognition and translation of sign language. The inclusion of additional information intended to identify and locate the moving body parts has been an interesting research topic recently. However, previous works on using multi-modal information raise concerns such as sub-optimal multi-modal feature merging method, or the model itself being too computationally heavy. In our work, we have addressed such issues and used a plugin module based on cross-attention to properly attend to each modality with another. Moreover, we utilized 2-stage training to remove the dependency of separate feature extractors for additional modalities in an end-to-end approach, which reduces the concern about computational complexity. Besides, our additional cross-attention plugin module is very lightweight which doesn't add…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Tactile and Sensory Interactions
