Conditional Variational Autoencoder for Sign Language Translation with   Cross-Modal Alignment

Rui Zhao; Liang Zhang; Biao Fu; Cong Hu; Jinsong Su; Yidong Chen

arXiv:2312.15645·cs.CL·December 27, 2023·1 cites

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel Conditional Variational Autoencoder framework for sign language translation that improves cross-modal alignment between sign videos and text, achieving state-of-the-art results without relying on intermediate gloss representations.

Contribution

The proposed CV-SLT model enables direct cross-modal alignment using dual KL divergences and a shared attention residual Gaussian distribution, advancing sign language translation methods.

Findings

01

Achieves new state-of-the-art results on PHOENIX14T and CSL-daily datasets.

02

Effectively alleviates cross-modal representation discrepancy.

03

Demonstrates the benefit of direct alignment over intermediate gloss reliance.

Abstract

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rzhao-zhsq/cv-slt
pytorchOfficial

Videos

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment· underline

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition