# Learning Individual Styles of Conversational Gesture

**Authors:** Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens,, Jitendra Malik

arXiv: 1906.04160 · 2019-06-11

## TL;DR

This paper presents a model that generates plausible hand and arm gestures from speech audio by learning from unlabeled videos, advancing the understanding of speech-gesture relationships.

## Contribution

It introduces a cross-modal translation model trained on unlabeled videos to generate person-specific gestures from speech audio.

## Key findings

- Model outperforms baseline methods in quantitative tests
- Provides a large dataset of person-specific gestures for research
- Demonstrates effective speech-to-gesture translation

## Abstract

Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures. The project website with video, code and data can be found at http://people.eecs.berkeley.edu/~shiry/speech2gesture .

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.04160/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1906.04160/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/1906.04160/full.md

---
Source: https://tomesphere.com/paper/1906.04160