TL;DR
This paper introduces FlashSign, a real-time sign language video generation framework that is pose-free, diffusion-based, and incorporates a novel attention mechanism to improve efficiency and quality.
Contribution
The work presents a pose-free, diffusion-based sign language video generator with a trainable attention mechanism, achieving 3.07x faster inference without quality loss.
Findings
Increases video generation speed by 3.07x
Eliminates reliance on pose estimation for sign language synthesis
Maintains high quality in real-time sign language video generation
Abstract
Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
