Unsupervised Sign Language Translation and Generation
Zhengsheng Guo, Zhiwei He, Wenxiang Jiao, Xing Wang, Rui Wang, Kehai, Chen, Zhaopeng Tu, Yong Xu, Min Zhang

TL;DR
This paper introduces USLNet, an unsupervised neural network that translates and generates sign language and text without parallel data, using cross-modality back-translation and a sliding window for alignment.
Contribution
USLNet is the first model to perform unsupervised sign language translation and generation for both text and sign language video in a unified framework.
Findings
Achieves competitive results on BOBSL and OpenASL datasets.
Effectively handles cross-modality feature discrepancies.
Demonstrates potential for unsupervised sign language applications.
Abstract
Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy version in the same modality and cross-modality back-translation modules (text-video-text and video-text-video) that reconstruct the input from its noisy version in the different modality using back-translation procedure.Unlike the single-modality back-translation procedure in text-based UNMT, USLNet faces the cross-modality discrepancy in feature representation, in which the length and the feature dimension mismatch between text and video sequences. We propose a sliding window method to address…
Peer Reviews
Decision·Submitted to ICLR 2024
The overall writing quality is good although there are some issues. The method is unsupervised which is important in the area as it requires experts to annotate. Also, inspired by unsupervised machine translation and applying the idea to another domain is the originality of the method. The proposed methods support the writing with detailed formulation and figures.
Discussion about existing text-to-video aligner algorithms is not sufficient. For example, although text2video[1] is a text-based talking face generation model, it uses an aligner for phoneme-to-pose. It seems back translations are highly similar to reconstruction loss that is used in image generation, especially in unpaired I2I tasks for cycle consistency. So you might consider elaborating this in the manuscript. There are no visual results on the manuscript and limited visual results on the
* Developing unsupervised approaches for SL generation/translation is important, especially given the many different representations used for signing. One could imagine fine-tuning this approach for any given representation (e.g., Glosses, HamNoSys). * There are reasonable comparisons to supervised approaches. * The ablations /sensitivity analysis comparing this approach with different aspects turned off is interesting. * Given the lack of work in this area, it was valuable to see comparison
Overall the results (e.g., Table 1 & 2) are seemingly very poor. This is by no means a reason to reject a paper, but it does in my opinion require the authors to dig deep into 'why' the results are poor and to work towards building an understanding for how they can be improved significantly. It is nice to see that some results are better than the supervised baseline from Albanie et al., but in an absolute sense they are still low. Are there oracle experiments that could be run? How can the probl
To the best of my knowledge this is the first bi-directional (translation/generation) SL approach that is trained in an unsupervised manner. Although the results are not promising, the proposed method is sound, and further studying the unsupervised training approach might yield promising results.
Although I like the idea of using pretrained large-scale models and unsupervised learning, I'd expect quantitative results to back up the benefits of employing these ideas. Sadly, the presented results does not suggest the presented approach to be "working" (~0.2 BLEU-4 score on BOBSL, while the state of the art is above 2 https://openaccess.thecvf.com/content/ICCV2023W/ACVR/papers/Sincan_Is_Context_all_you_Need_Scaling_Neural_Sign_Language_Translation_ICCVW_2023_paper.pdf) That being said, th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
