Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

TL;DR
This paper introduces MMSLT, a novel gloss-free sign language translation framework that leverages multimodal large language models to generate detailed descriptions and align sign language videos with spoken language, achieving state-of-the-art results.
Contribution
The paper presents a new gloss-free SLT approach using MLLMs for detailed description generation and multimodal pre-training, advancing sign language translation without relying on gloss annotations.
Findings
Achieves state-of-the-art performance on PHOENIX14T and CSL-Daily datasets.
Effectively bridges modality gap with multimodal-language pre-training.
Demonstrates the potential of MLLMs in sign language translation tasks.
Abstract
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Natural Language Processing Techniques
MethodsALIGN
