RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

Zhi Rao; Yucheng Zhou; Benjia Zhou; Yiqing Huang; Sergio Escalera; Jun Wan

arXiv:2512.07273·cs.CV·December 9, 2025

RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan

PDF

Open Access

TL;DR

RVLF introduces a novel vision-language framework for gloss-free sign language translation, combining semantic representation learning with reinforcement learning to improve translation accuracy and semantic consistency without external large-scale datasets.

Contribution

The paper proposes a three-stage RVLF framework that fuses visual and skeletal cues, and applies GRPO-based reinforcement learning for the first time in sign language translation.

Findings

01

Significant BLEU-4 score improvements across multiple datasets.

02

Effective semantic representation learning combining skeleton cues and visual features.

03

GRPO-based optimization enhances translation quality and semantic alignment.

Abstract

Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Hearing Impairment and Communication