Boosting Continuous Sign Language Recognition via Cross Modality Augmentation
Junfu Pu, Wengang Zhou, Hezhen Hu, Houqiang Li

TL;DR
This paper introduces a cross modality augmentation framework for continuous sign language recognition that improves alignment between video and text, leading to better performance on benchmark datasets.
Contribution
It proposes a novel augmentation method simulating WER operations and multiple loss terms to enhance cross-modal learning in CTC-based SLR models.
Findings
Significant performance improvements on RWTH-PHOENIX-Weather and CSL datasets.
Effective reduction in word error rate through cross modality augmentation.
Framework is adaptable to existing CTC-based SLR architectures.
Abstract
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCircular Smooth Label
