Linguistics-Vision Monotonic Consistent Network for Sign Language   Production

Xu Wang; Shengeng Tang; Peipei Song; Shuo Wang; Dan Guo; Richang Hong

arXiv:2412.16944·cs.CV·December 24, 2024

Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, Richang Hong

PDF

Open Access

TL;DR

This paper introduces a Transformer-based network that enhances sign language production by ensuring better cross-modal alignment and semantic consistency between sign glosses and videos, addressing key challenges in linguistics-vision integration.

Contribution

The proposed LVMCN model incorporates novel cross-modal semantic alignment and multimodal semantic comparison mechanisms to improve sign language video generation accuracy.

Findings

01

LVMCN outperforms previous methods on PHOENIX14T benchmark.

02

Improved cross-modal alignment accuracy demonstrated.

03

Enhanced semantic consistency between glosses and videos.

Abstract

Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Robotics and Automated Systems