Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment
Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen

TL;DR
This paper introduces a novel Conditional Variational Autoencoder framework for sign language translation that improves cross-modal alignment between sign videos and text, achieving state-of-the-art results without relying on intermediate gloss representations.
Contribution
The proposed CV-SLT model enables direct cross-modal alignment using dual KL divergences and a shared attention residual Gaussian distribution, advancing sign language translation methods.
Findings
Achieves new state-of-the-art results on PHOENIX14T and CSL-daily datasets.
Effectively alleviates cross-modal representation discrepancy.
Demonstrates the benefit of direct alignment over intermediate gloss reliance.
Abstract
Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
