TL;DR
This paper presents a deep learning model that enables social robots to estimate the intended addressee of human utterances by analyzing non-verbal cues, facilitating more natural human-robot interactions.
Contribution
It introduces a hybrid deep learning approach combining CNNs and LSTMs for addressee estimation using visual and bodily cues, optimized for deployment on social robots.
Findings
Model accurately localizes addressees in space from robot perspective
Effective use of face images and body posture vectors for addressee detection
Potential for improved social robot interaction capabilities
Abstract
Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
