RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation
Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan

TL;DR
RVLF introduces a novel vision-language framework for gloss-free sign language translation, combining semantic representation learning with reinforcement learning to improve translation accuracy and semantic consistency without external large-scale datasets.
Contribution
The paper proposes a three-stage RVLF framework that fuses visual and skeletal cues, and applies GRPO-based reinforcement learning for the first time in sign language translation.
Findings
Significant BLEU-4 score improvements across multiple datasets.
Effective semantic representation learning combining skeleton cues and visual features.
GRPO-based optimization enhances translation quality and semantic alignment.
Abstract
Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Hearing Impairment and Communication
