ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation
Hitoshi Teshima, Naoki Wake, Diego Thomas, Yuta Nakashima, Hiroshi, Kawasaki, Katsushi Ikeuchi

TL;DR
This paper introduces ACT2G, a novel attention-based contrastive learning method for generating content-representative gestures from text, improving realism and diversity in avatar communication.
Contribution
It proposes a new contrastive learning approach that aligns text and gesture features in a shared latent space, enabling content-aware gesture generation.
Findings
User study shows ACT2G outperforms existing methods
Generated gestures better reflect text content
Wide variation in gestures from same text demonstrated
Abstract
Recent increase of remote-work, online meeting and tele-operation task makes people find that gesture for avatars and communication robots is more important than we have thought. It is one of the key factors to achieve smooth and natural communication between humans and AI systems and has been intensively researched. Current gesture generation methods are mostly based on deep neural network using text, audio and other information as the input, however, they generate gestures mainly based on audio, which is called a beat gesture. Although the ratio of the beat gesture is more than 70% of actual human gestures, content based gestures sometimes play an important role to make avatars more realistic and human-like. In this paper, we propose a attention-based contrastive learning for text-to-gesture (ACT2G), where generated gestures represent content of the text by estimating attention weight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Hearing Impairment and Communication
