Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook
Yingchao Li

TL;DR
This paper introduces a real-time, human-centered speech-to-sign language system that allows user editing and continuous model refinement, significantly improving naturalness, expressivity, and user trust in sign-language animation.
Contribution
It presents a novel streaming Conformer-Transformer architecture with an editable JSON interface and a feedback loop for personalized, low-latency sign-language generation.
Findings
Achieved 13 ms frame inference time and 103 ms latency on RTX 4070.
Improved user satisfaction with +13 SUS points and reduced cognitive load.
Significant enhancements in naturalness and trust over baseline systems.
Abstract
Existing end-to-end sign-language animation systems suffer from low naturalness, limited facial/body expressivity, and no user control. We propose a human-centered, real-time speech-to-sign animation framework that integrates (1) a streaming Conformer encoder with an autoregressive Transformer-MDN decoder for synchronized upper-body and facial motion generation, (2) a transparent, editable JSON intermediate representation empowering deaf users and experts to inspect and modify each sign segment, and (3) a human-in-the-loop optimization loop that refines the model based on user edits and ratings. Deployed on Unity3D, our system achieves a 13 ms average frame-inference time and a 103 ms end-to-end latency on an RTX 4070. Our key contributions include the design of a JSON-centric editing mechanism for fine-grained sign-level personalization and the first application of an MDN-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Social Robot Interaction and HRI
