Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Jungeun Kim; Hyeongwoo Jeon; Jongseong Bae; Ha Young Kim

arXiv:2411.16789·cs.CV·August 26, 2025·2 cites

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

PDF

Open Access

TL;DR

This paper introduces MMSLT, a novel gloss-free sign language translation framework that leverages multimodal large language models to generate detailed descriptions and align sign language videos with spoken language, achieving state-of-the-art results.

Contribution

The paper presents a new gloss-free SLT approach using MLLMs for detailed description generation and multimodal pre-training, advancing sign language translation without relying on gloss annotations.

Findings

01

Achieves state-of-the-art performance on PHOENIX14T and CSL-Daily datasets.

02

Effectively bridges modality gap with multimodal-language pre-training.

03

Demonstrates the potential of MLLMs in sign language translation tasks.

Abstract

Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Natural Language Processing Techniques

MethodsALIGN