Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook

Yingchao Li

arXiv:2506.14677·cs.HC·June 25, 2025

Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook

Yingchao Li

PDF

Open Access

TL;DR

This paper introduces a real-time, human-centered speech-to-sign language system that allows user editing and continuous model refinement, significantly improving naturalness, expressivity, and user trust in sign-language animation.

Contribution

It presents a novel streaming Conformer-Transformer architecture with an editable JSON interface and a feedback loop for personalized, low-latency sign-language generation.

Findings

01

Achieved 13 ms frame inference time and 103 ms latency on RTX 4070.

02

Improved user satisfaction with +13 SUS points and reduced cognitive load.

03

Significant enhancements in naturalness and trust over baseline systems.

Abstract

Existing end-to-end sign-language animation systems suffer from low naturalness, limited facial/body expressivity, and no user control. We propose a human-centered, real-time speech-to-sign animation framework that integrates (1) a streaming Conformer encoder with an autoregressive Transformer-MDN decoder for synchronized upper-body and facial motion generation, (2) a transparent, editable JSON intermediate representation empowering deaf users and experts to inspect and modify each sign segment, and (3) a human-in-the-loop optimization loop that refines the model based on user edits and ratings. Deployed on Unity3D, our system achieves a 13 ms average frame-inference time and a 103 ms end-to-end latency on an RTX 4070. Our key contributions include the design of a JSON-centric editing mechanism for fine-grained sign-level personalization and the first application of an MDN-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Social Robot Interaction and HRI