Beyond the Self: Using Grounded Affordances to Interpret and Describe   Others' Actions

Giovanni Saponaro; Lorenzo Jamone; Alexandre Bernardino; Giampiero; Salvi

arXiv:1902.09705·cs.RO·June 12, 2020

Beyond the Self: Using Grounded Affordances to Interpret and Describe Others' Actions

Giovanni Saponaro, Lorenzo Jamone, Alexandre Bernardino, Giampiero, Salvi

PDF

1 Repo

TL;DR

This paper presents a developmental approach enabling robots to interpret and describe human actions by leveraging learned object affordances and their own experiences, facilitating social collaboration.

Contribution

It introduces a method for robots to reuse experience to interpret human actions and generate scene descriptions, integrating affordance learning with action recognition.

Findings

01

Model can predict effects of actions based on object properties

02

Can revise beliefs about actions from observed effects

03

Generates relevant scene descriptions using probabilistic inference

Abstract

We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the…

Tables2

Table 1. Table I: The symbolic variables of the Bayesian Network (from [ 16 ] ), with the corresponding discrete values obtained from clustering during robot exploration of the environment. We call word variables the booleans of the last row, whereas we call affordance variables all the other symbols.

symbol	name: description	values
$a$	Action: motor action	grasp, tap, touch
$f_{1}$	Color: object color	blue, yellow, green1, green2
$f_{2}$	Size: object size	small, medium, big
$f_{3}$	Shape: object shape	sphere, box
$e_{1}$	ObjVel: object velocity	slow, medium, fast
$e_{2}$	HandVel: robot hand velocity	slow, fast
$e_{3}$	ObjHandVel: relative object–hand velocity	slow, medium, fast
$e_{4}$	Contact: object hand contact	short, long
$w_{1}$ – $w_{49}$	presence of each word in the verbal description	true, false

Table 2. Table II: 10 10 10 -best list of sentences generated from the evidence X obs = { Color=yellow, Size=big, Shape=sphere, ObjVel=fast } subscript 𝑋 obs Color=yellow, Size=big, Shape=sphere, ObjVel=fast X_{\text{obs}}=\{\text{Color=yellow, Size=big, Shape=sphere, ObjVel=fast}\} .

sentence	score
“the robot pushed the ball and the ball moves”	$- 0.54322$
“the robot tapped the sphere and the sphere moves”	$- 0.5605$
“he is pushing the sphere and the sphere moves”	$- 0.57731$
“the robot is tapping the yellow ball and the big yellow sphere is moving”	$- 0.57932$
“he pushed the yellow ball and the sphere is rolling”	$- 0.58853$
“the robot is poking the ball and the sphere is rolling”	$- 0.58998$
“he is pushing the ball and the yellow ball moves”	$- 0.59728$
“he pushes the sphere and the ball is moving”	$- 0.60528$
“he is tapping the yellow ball and the ball is moving”	$- 0.60675$
“the robot pokes the sphere and the ball is rolling”	$- 0.60694$

Equations21

P_{HMM} (A = a_{k} ∣ G_{1}^{T}) = \frac{L _{HMM} ( G _{1}^{T} ∣ A = a _{k} )}{\sum _{h} L _{HMM} ( G _{1}^{T} ∣ A = a _{h} )} .

P_{HMM} (A = a_{k} ∣ G_{1}^{T}) = \frac{L _{HMM} ( G _{1}^{T} ∣ A = a _{k} )}{\sum _{h} L _{HMM} ( G _{1}^{T} ∣ A = a _{h} )} .

P_{BN} (X_{inf} ∣ X_{obs}) = X_{lat} \sum P_{BN} (X_{inf}, X_{lat} ∣ X_{obs}) .

P_{BN} (X_{inf} ∣ X_{obs}) = X_{lat} \sum P_{BN} (X_{inf}, X_{lat} ∣ X_{obs}) .

P_{comb} (X_{inf} ∣ X_{obs}, G_{1}^{T}) = P_{comb} (A, X_{inf}^{'} ∣ X_{obs}, G_{1}^{T}) =

P_{comb} (X_{inf} ∣ X_{obs}, G_{1}^{T}) = P_{comb} (A, X_{inf}^{'} ∣ X_{obs}, G_{1}^{T}) =

= X_{lat} \sum P_{comb} (A, X_{inf}^{'}, X_{lat} ∣ X_{obs}, G_{1}^{T}) =

= X_{lat} \sum [P_{BN} (A, X_{inf}^{'}, X_{lat} ∣ X_{obs}, G_{1}^{T})

\mspace 80.0 m u P_{HMM} (A, X_{inf}^{'}, X_{lat} ∣ X_{obs}, G_{1}^{T})] =

= [X_{lat} \sum P_{BN} (A, X_{inf}^{'}, X_{lat} ∣ X_{obs})] P_{HMM} (A ∣ G_{1}^{T}) =

= P_{BN} (X_{inf} ∣ X_{obs}) P_{HMM} (A ∣ G_{1}^{T}) .

P_{comb} (X_{inf} ∣ X_{obs}, G_{1}^{T}) =

P_{comb} (X_{inf} ∣ X_{obs}, G_{1}^{T}) =

= {A, X_{lat}^{'}} \sum P_{comb} (X_{inf}, A, X_{lat}^{'} ∣ X_{obs}, G_{1}^{T}) =

= {A, X_{lat}^{'}} \sum [P_{BN} (X_{inf}, A, X_{lat}^{'} ∣ X_{obs}, G_{1}^{T})

\mspace 100.0 m u P_{HMM} (X_{inf}, A, X_{lat}^{'} ∣ X_{obs}, G_{1}^{T})] =

= {A, X_{lat}^{'}} \sum [P_{BN} (X_{inf}, A, X_{lat}^{'} ∣ X_{obs}) P_{HMM} (A ∣ G_{1}^{T})] =

= A \sum P_{HMM} (A ∣ G_{1}^{T}) X_{lat}^{'} \sum P_{BN} (X_{inf}, A, X_{lat}^{'} ∣ X_{obs}) =

= A \sum [P_{HMM} (A ∣ G_{1}^{T}) P_{BN} (X_{inf}, A ∣ X_{obs})] .

score (s_{j} ∣ X_{obs}, G_{1}^{t}) = \frac{1}{L _{j}} k = 1 \sum L_{j} lo g P (w_{j k} ∣ X_{obs}, G_{1}^{t}),

score (s_{j} ∣ X_{obs}, G_{1}^{t}) = \frac{1}{L _{j}} k = 1 \sum L_{j} lo g P (w_{j k} ∣ X_{obs}, G_{1}^{t}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gsaponaro/tcds-gestures
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

AI Artificial Intelligence BN Bayesian Network CFG context-free grammar HMM Hidden Markov Model OAC Object–Action Complex PDF Probability Density Function

Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions

Giovanni Saponaro, Lorenzo Jamone, Alexandre Bernardino, Giampiero Salvi

Manuscript received November 15, 2017; revised September 6, 2018; accepted November 14, 2018. This research was supported by the FCT projects UID/EEA/50009/2013, AHA CMUP-ERI/HCI/0046/2013 and by the CHIST-ERA project IGLU.G. Saponaro and A. Bernardino are with the Institute for Systems and Robotics, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal, e-mail: {gsaponaro,alex}@isr.tecnico.ulisboa.pt.L. Jamone is with ARQ (Advanced Robotics at Queen Mary), School of Electronic Engineering and Computer Science, Queen Mary University of London, United Kingdom and with the Institute for Systems and Robotics, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal, e-mail: [email protected]. Salvi is with KTH Royal Institute of Technology, Stockholm, Sweden, e-mail: [email protected].

Abstract

We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans.

Index Terms:

affordances, embodied cognition, gestures, humanoid robots, language acquisition through development.

I Introduction

Cooperation, or the ability of working successfully in groups, is a tenet of human society [1]. This skill is acquired by human children incrementally, around the second year of life, as they develop the ability to coordinate themselves with peers or adult caregivers in shared problem-solving activities and social games [2]. This is achieved not only by mere behavioral coordination, but also by employing communicative strategies [3] and by continuously observing partners’ actions [4]. Loosely inspired by these observations, this article presents and evaluates a cognitive system for robots which permits reasoning over subsequent phases: first about self-learned knowledge (about affordances and language-based descriptions of objects), and then about others’ actions.

Even though social robots111A social robots is “[a robot that is] able to communicate and interact with us, understand and even relate to us, in a personal way. [It] should be able to understand us and itself in social terms” [5]. are becoming common in domestic and public environments, human–robot teams still lag behind human–human teams in terms of effectiveness. For robots, interpreting the actions of others and learning to describe them verbally (for effective cooperation) is challenging. The reason is that we cannot possibly model all the imaginable physical, verbal and non-verbal (e.g., gestures) cues that can take place during human–robot interaction, due to the richness of language and the high variability of the real world outside of structured research laboratories and factories. Hence, it is necessary to have robots that learn world elements and properties of language [6], and the ability to link these verbal elements with other skills, such as other perceptual modalities (e.g., vision of objects and other agents) and manipulation abilities (e.g., grasping objects and placing them in order to achieve a goal) [7].

Our work builds upon the intuition that a robot can generalize its previously-acquired knowledge of the world (e.g., motor actions, objects properties, physical effects, verbal descriptions) to those situations where it observes a human agent performing familiar actions in a shared human–robot scenario. We follow the developmental robotics perspective [8, 9], which takes inspiration from the progressive learning phenomena observed in children’s mental development (e.g., the understanding of language, the acquisition of manipulation skills, the comprehension of others’ actions), and investigates how to model the evolution and acquisition of these increasingly complex cognitive processes in artificial autonomous systems.

In particular, we are inspired by the possible existence of a shared representation for self-related and others-related knowledge in the human brain [10, 11, 12], and we look at the developmental stages in which human children have consolidated an idea of self–other distinction [13] and start to reason about the external world also in allocentric terms [14], in addition to the ego-centric ones, and could therefore possibly begin to use knowledge about the self to infer about others.

Extending on our recent work [15], in this article we combine robot ego-centric learning about language and object affordances [16] with the observation of external agents through gesture recognition [17]. Our novel contributions are: (i) a probabilistic method to fuse self-learned knowledge of language and object affordances, with socially aware information of others’ physical actions (in the form of uncertain soft evidence); (ii) experimental findings showing the reasoning power of our combined system, which is able to make inferences and predictions over affordances and words; and (iii) the possibility of generating verbal descriptions from the estimated word probabilities and a pre-defined grammar, with emergence of non-trivial language properties such as congruent/incongruent conjunctions, synonyms between two consecutive sentences speaking about the same concepts. Furthermore, we make our human action data and probabilistic reasoning code publicly available222https://github.com/giampierosalvi/AffordancesAndSpeech: the code from [16] has been extended to support the experiments in this study.333https://github.com/gsaponaro/tcds-gestures: code from this paper. in the interest of reproducibility.

This article is structured as follows. In Sec. II we briefly overview the literature on the interpretation and verbal description of others in different disciplines, in Sec. III we present our proposed method and its components, in Sec. IV we provide details and assumptions of the approach, Sec. V illustrates our results, and in Sec. VI we draw our concluding remarks.

II Related Work

Human cooperation is a phenomenon that we often take for granted (at least in adults), possibly because it is widespread and intimately embedded into human societies. However, this non-trivial skill is greatly facilitated, and influenced, by human language [18]. For instance, educational research has shown that, when language is used as a cultural tool for intellectual tasks in preteen students, discursive interaction enables collective thinking to become more effective, also fostering individual reasoning and faster learning [19].

The ability to understand and interpret our peers has also been studied in neuroscience and psychology, focusing on internal simulations and re-enactments of previous experiences [20, 21], or on visuomotor neurons [11], i.e., neurons that are activated by visual stimuli. Mirror neurons respond to action and object interaction, both when the agent acts and when it observes the same action performed by others, hence the name “mirror”. They are based on the principle that perceptual input can be linked with the human action system for predicting future outcomes of actions, i.e., the effect of actions, particularly when the person possesses concrete prior personal experience of the actions being observed in others [22, 23].

In applying the mirror neuron theory in robotics, as we and others do [24, 25], an agent can first acquire knowledge by sensing and self-exploring its surrounding environment. Afterwards, it can employ that learned knowledge to novel observations of another agent (e.g., a human person) who performs similar physical actions to the ones executed during prior training. In particular, when the two interacting agents are a caregiver and an infant, the mechanism is called parental scaffolding, having been implemented on robots too [26, 27]. These works tackle the so-called correspondence problem [28], in our case in a simple collaboration scenario, assuming that the two agents are capable of applying actions to objects leading to similar effects, enabling the transfer, and that they operate on a shared space (i.e., a table accessible by both agents’ arms). The morphology and the motor realization of the actions can be different between the two agents.

Some authors have studied the ability to interpret other agents under the deep learning paradigm. In [29], a recurrent neural network is proposed to have an artificial simulated agent infer human intention (as output) from joint input information about objects, their potential affordances or opportunities, and human actions, employing different time scales for different actions. However, in that work a virtual simulation able to produce large quantities of data was used. This is both unrealistic when trying to explain human cognition, and limited, because a simulator cannot model all the physical events and the unpredictability of the real world. In contrast, we use real, noisy data acquired from robots and sensors to validate our model. In addition, deep neural networks trained with large amounts of data can be difficult to inspect in their inner layers and activations [30], whereas our Bayesian model is focused on exhibiting emerging patterns of causality, choices, explanations from relatively few data points.

DeepMind and Google published a method [31] to perform relational reasoning on images, i.e., a system that learns to reflect about entities and their mutual relations, with the ability of providing answers to questions such as “Are there any rubber things that have the same size as the yellow metallic cylinder?”. That work is very powerful from the point of view of cognitive systems, vision and language. Our approach is different because (i) we focus on robotic cognitive systems, including manipulation and the uncertainties inherent to robot vision and control, and (ii) we follow the developmental paradigm and the embodiment hypothesis [8], meaning that, leveraging the fact that a human and a humanoid produce actions with similar effects, we relate words with the robot’s sensorimotor experience, rather than sensory only (purely images-to-text).

In robotics and cognitive systems research, both object-directed action recognition in external agents [32] and the incorporation of language in human–robot systems [33, 34] have received ample attention, for example using the concept of intuitive physics [35, 36] to be able to predict outcomes from real or simulated interactions with objects. A growing interest is devoted to robots that learn new cognitive skills and improve their capabilities by interacting autonomously with the surrounding environment. Robots operating in the real, unstructured world may understand available opportunities conditioned on their body, perception and sensorimotor experiences: the intersection of these elements gives rise to object affordances (action possibilities), as they are called in psychology [37]. The advantage of robot affordances lies in the ability to capture essential functional properties of environment objects in terms of the actions that the agent is able to perform with them, allowing to reason with prior knowledge about never-before-seen scenarios, thus exhibiting learning [38, 39] and some degree of online adaptation [40].

Zech et al. published a systematic taxonomy of robot affordance models [41]. According to their criteria (we refer the reader to the taxonomy for the precise definitions), in terms of perception our work classifies as using an agent perspective, meso-level features, $1$ st order, stable temporality; in terms of development: acquisition by exploration, prediction by inference, generalization exploitation by action selection and language, offline learning.

Several works have studied the potential coupling between learning robot affordances and language grounding. The union of these two elements can give new skills to cognitive robots, such as: creation of categorical concepts from multimodal association obtained by grasping and observing objects, while listening to partial verbal descriptions [42, 43]; associating spoken words with sensorimotor experience [16, 44]; linking language with sensorimotor representations [45]; or carrying out complex tasks (which require planning of a sequence of actions) expressed in natural language instructions to a robot [46].

In particular Salvi et al. [16], which this paper extends, proposes a joint model to learn robot affordances (i.e., relationships between actions, objects and resulting effects) together with word meanings. The data used for learning such a model is from robot manipulation experiments, acquired from an ego-centric perspective. Each experiment is associated with a number of alternative verbal descriptions uttered by two human speakers, for a total of $1270\text{\,}\leavevmode\nobreak\$ recordings. That framework assumes that the robot action is known a priori during the training phase (e.g., during a grasping action the robot knows with certainty that it is performing a grasp), and the resulting model can be used at testing to make inferences about the environment. In a recent work [15] we relaxed the assumption of knowing the action. We did this by merging the action estimation obtained from an external gesture recognizer [17] as hard evidence (i.e., certain evidence) to the full model, meaning that the action was deterministic. By contrast, in this paper we propose a theoretical way to fuse the two sources of information (about the self and about others) in a fully probabilistic manner, therefore introducing soft evidence. This addition allows to perform more fine-grained types of inferences and reasoning than before. First, predictions over affordances and words when observing another agent with uncertainty. Second, the generation of verbal descriptions from the estimated word probabilities, for easier human interpretation of the model’s explanations.

III Method

The purpose of our work is to model the development of language learning from self-centered, individualistic learning to socially aware learning. This transition happens gradually in subsequent phases. In the first phase, the system engages in manipulation activities with objects in its environment [38]. The robot learns object affordances by associating object properties, actions and the corresponding effects. In a second phase, the robot interacts with a human who uses spoken language to describe the robot’s activities [16]. Here, the robot interprets the meaning of the words, grounding them in the action–perception experience acquired so far. Although this phase can already be considered social for the presence of a human narrator, it is still self-centered, because the robot is still learning how to interpret its own actions. In the last phase, which is the contribution of this work, the system turns to observing human actions of a similar nature as the ones explored in the first phases (see examples in Fig. 1). The robot reuses the experience acquired in the first phases to interpret the new observations and to address the correspondence problem [28] between its own actions and the actions performed by the human. In this phase, human movements are interpreted using the experience acquired so far, and they are incorporated into the model using a statistical gesture recognizer [17].

Fig. 2 illustrates the probabilistic dependencies in the complete model and will be detailed in the following subsections.

To permit the transfer from robot self-centered knowledge to human knowledge to work, we assume that the same actions, performed on objects with the same properties, cause the same effects and are described by the same words. In other terms, all of the variables under consideration (which will be described in Sec. IV) are the link between robot and human.

In our theoretical formulation and in our implementation, we will hinge on the existence of the discrete Action variable, the value of which is known to the robot in the ego-centric phase of learning, but must be inferred when observing human actions. This variable connects all the other observable variables in the model: human gesture features, object properties, effect variables and words. This allows the robot to:

•

use language in order to determine the mapping between human and own actions, and learn the corresponding perceptual models;

•

in many cases, use the affordance variables to infer the above mapping even in the absence of verbal descriptions;

•

once the perceptual models for human actions are acquired, use the complete model to do inference on any variable given some evidence.

In the remainder of this section, first we provide details, in Sec. III-A, about the probabilistic models enclosed in the Affordance–Words model box of Fig. 2. Then, in Sec. III-B we describe the gesture recognition method. Finally, in Sec. III-C we describe the way in which we combine evidence from the two models.

III-A Affordance–Words Model

We use a Bayesian probabilistic framework to allow a robot to ground the basic world behavior and verbal descriptions associated to it. All variables in the model are discrete or are discretized from continuous sensory variables through clustering in a preliminary learning phase. The variables can be divided according to their use: action variable $A=\{a\}$ , object feature variables $F=\{f_{1},f_{2},\dots\}$ , effect variables $E=\{e_{1},e_{2},\dots\}$ and word variables $W=\{w_{1},w_{2},\dots\}$ . Details on the specific variables used in this study are given in Sec. IV.

The Bayesian Network (BN) model [47] relates all these variables by means of the joint probability distribution $P_{\text{BN}}(A,F,E,W)$ , sketched by the Affordance–Words model box in Fig. 2. The dependency structure and the model parameters are estimated by the robot in an ego-centric way through interaction with the environment. As a consequence, during learning, the robot knows what action it is performing with certainty, and the variable $A$ assumes a deterministic value. During inference, the probability distribution of the variable $A$ can be inferred from evidence on the other variables. For example, if the robot is asked to make a spherical object roll, it will be able to select the action tap as most likely to obtain the desired effect, based on previous experience.

III-B Gesture Recognition

When observing a human performing an action, the value of the variable $A$ is not known to the robot neither during learning nor during inference. During learning, we assume that the robot has not yet acquired a perceptual model of the gestures associated to the human actions. However, the value of $A$ can be inferred, either from a verbal description of the scene, or from the other affordance variables through the Affordance–Words model described earlier.

For example, suppose that the Affordance–Words model predicts that performing a tap action on a spherical object will result in a high velocity of the object. If the human performs an unknown action on a spherical object and obtains a high velocity, the robot will be able to infer that the action is most probably a tap, although it was not able to recognize the gesture associated with this action.

This information can be used to train our statistical gesture recognition system [17]. The system recognizes actions (from gesture features) and corresponds to the Gesture/Action recognition block in Fig. 2. It is based on Hidden Markov Models with Gaussian mixture models as emission probability distributions. Our input features are the 3D coordinates of the tracked human hand indicated by the $g_{i}$ variables in Fig. 2. The coordinates are transformed to be centered on the person torso (to be invariant to the distance between the user and the sensor) and normalized to account for variability in amplitude (to be invariant to wide/emphatic vs narrow/subtle executions of the same action).

The model for each action is a left-to-right HMM, where the transition model between the $Q$ discrete states $\mathcal{S}=\{s_{1},\dots,s_{Q}\}$ is structured so that states with a lower index represent events that occur earlier in time.

Although not expressed so far in the notation, the continuous variables $g_{i}$ are measured at regular time intervals. At a certain time step $t$ , the $D$ -dimensional feature vector can be expressed as $\bm{g}[t]=\{g_{1}[t],\dots,g_{D}[t]\}$ . The input to the model is a sequence of $T$ such feature vectors $\bm{g}[1],\dots,\bm{g}[T]$ that we call for simplicity $G_{1}^{T}$ , where $T$ can vary for every recording.

At recognition (testing) time, we can use the models to estimate the likelihood of a new sequence of observations $G_{1}^{T}$ given each possible action, by means of the Forward–Backward inference algorithm. We can express this likelihood as $\mathcal{L}_{\text{HMM}}(G_{1}^{T}\mid A=a_{k})$ , where $a_{k}$ is one of the possible actions. By normalizing the likelihoods, assuming that the gestures are equally likely a priori, we can obtain the posterior probability of the action given the sequence of observations as

[TABLE]

III-C Combining the BN with Gesture HMMs

Once learned, the two models described above define two probability distributions over the relevant variables for the problem: $P_{\text{BN}}(A,F,E,W)$ and $P_{\text{HMM}}(A\mid G_{1}^{T})$ . The goal during inference is to merge the information provided by both models and estimate $P_{\text{comb}}(A,F,E,W\mid G_{1}^{T})$ , that is, the joint probability of all the affordance and word variables, given that we observe a certain action performed by the human.

To simplify the notation, we call $X=\{A,F,E,W\}$ the set of affordance and word variables $\{a,f_{1},f_{2},\dots,e_{1},e_{2},\dots,w_{1},w_{2},\dots\}$ . During inference, we have a (possibly empty) set of observed variables $X_{\text{obs}}\subseteq X$ , and a set of variables $X_{\text{inf}}\subseteq X$ on which we wish to perform the inference. In order for the inference to be non-trivial, it must be $X_{\text{obs}}\cap X_{\text{inf}}=\varnothing$ , that is, we should not observe any inference variable. According to the BN alone, the inference will compute the probability distribution of the inference variables $X_{\text{inf}}$ given the observed variables $X_{\text{obs}}$ by marginalizing over all the other (latent) variables $X_{\text{lat}}=X\setminus(X_{\text{obs}}\cup X_{\text{inf}})$ , where $\setminus$ is the set difference operation:

[TABLE]

If we want to combine the evidence brought by the BN with the evidence brought by the HMM, there are two cases that can occur:

the action variable is included among the inference variables: $A\in X_{\text{inf}}$ , or 2. 2.

the action variable is not included among the inference variables: $A\in X_{\text{lat}}$ .

Here, we are excluding the case where we observe the action directly ( $A\in X_{\text{obs}}$ ) for two reasons. First, this would correspond to the robot performing it by itself, whereas we are interested in interpreting other people’s actions, which is a necessary skill to engage in social collaboration with humans. Second, this would make the evidence on the gesture features $G_{1}^{T}$ irrelevant, because in the model of Fig. 2 there is a tail-to-tail connection [47] from $G_{1}^{T}$ to the rest of the variables through the action variable, which means that, given the action, all dependencies to the gesture features are dropped.

The two cases 1), 2) enumerated above can be addressed separately when we do inference. In the first case, we call $X_{\text{inf}}^{\prime}$ the set of inference variables excluding the action $A$ , that is, $X_{\text{inf}}=\{X_{\text{inf}}^{\prime},A\}$ . We can write:

[TABLE]

This means that we can evaluate the two models independently, then multiply the distribution that we obtain from the BN (over all the possible value of the inference variables) by the HMM posterior for the corresponding value of the action.

In the second case, where the action is among the latent variables, we define, similarly, $X_{\text{lat}}=\{A,X_{\text{lat}}^{\prime}\}$ , and we have:

[TABLE]

This time, we first need to use the BN to do inference on the variables $X_{\text{inf}}$ and $A$ , and then we marginalize out the action variable $A$ after having multiplied the probabilities by the HMM posterior.

III-D Generation and Scoring of Verbal Descriptions

In order to illustrate the language capabilities of the model, rather than displaying the probability distribution of the words inferred by the model, we use the context-free grammar (CFG) described in Appendix A to generate written descriptions of the robot observations, on the basis of those probabilities. Note that this grammar is defined here with the only purpose of interpreting the probability distributions over the words. In the Affordance–Words model that we use, the speech recognizer is based on a free loop of words with uniform prior, and the Bayesian model relies on a bag-of-words assumption. No grammatical (syntactic) information about the spoken descriptions was, therefore, used during learning.

In the current study, by merging the Affordance–Words model and the gesture recognition model, we allow the robot to reinterpret the concepts it has learned in the self-centered phase, but we do not add any new words to the model. Consequently, the descriptions that the model generates when observing humans use the same words to describe the agent (see also Sec. V-E).

The textual descriptions are generated as follows: given some evidence $X_{\text{obs}}$ that we provide to the model and some human observation features $G_{1}^{t}$ extracted from frames $1$ to $t$ , we extract the generated word probabilities $P(w_{i}\mid X_{\text{obs}},G_{1}^{t})$ . We generate $N$ sentences randomly from the CFG using the HSGen tool from HTK [48]. Then, the sentences are re-scored according to the log-likelihood of each word in the sentence, normalized by the length of the sentence:

[TABLE]

where $s_{j}$ is the $j$ th sentence, $L_{j}$ is the number of words in the sentence $s_{j}$ , and $w_{jk}$ is the $k$ th word in the sentence $s_{j}$ . Finally, an $N$ -best list of possible descriptions is produced by sorting the scores.

IV Experimental Settings

Our experiments consist on testing our method on a number of example scenarios that will be described in Sec. V. In this section we provide experimental details and key assumptions of the method.

IV-A Affordance–Words Model

Table I presents a list of variables and the corresponding values used in the Affordance–Words model. Note that the name of the values of the affordance variables have been assigned by us arbitrarily to the clusters, for the sake of making the results more human-interpretable. However, the robot has no prior knowledge about the meaning of these clusters nor about their order, in case they correspond to ordered quantities. For extracting object features and effects from the sensory data, we assume that the robot possesses visual segmentation and geometric reasoning capabilities, meaning that it is able to segment the (potentially multiple) regions of interest corresponding to the physical objects of the world from the background (e.g., a planar surface such as a table) and to determine their positions.

We use the following notation in order to distinguish between the values of the affordance variables (all but the last row in Table I) and the words (last row in the table). Words and sentences are always enclosed in quotation marks. For example, “sphere” will refer to the spoken word, whereas sphere will refer to the value of the Shape variable corresponding to the specific cluster. Similarly, “grasp” will correspond to a spoken word, whereas grasp corresponds to a value of the Action variable.

There is no one-to-one correspondence between the values of the affordance variables and words. This was partly emerging from the natural variability that is inherent in the way humans describe situations in spoken words. It was also a design choice, because we wanted to prove that the model was not merely able to recover simple word–meaning associations, but was able to cope with more natural spoken utterances. Consequently, in the spoken descriptions: (i) there are many synonyms for the same concept: for instance, cubic objects are called “box”, “square” or “cube”. Also, actions and effects are described using different tenses (“is grasping”, “grasped”, “has (just) grasped”); (ii) different affordance variable values may have the same associated verbal description, e.g., two color clusters corresponding to different shades of green are both referred to as “green”; (iii) finally, many affordance variable values have no direct description: for example, the object velocity and object–hand velocity (slow, medium, fast), or the object–hand contact (short, long) are never described directly, and need to be inferred from the situation.

The Affordance–Words model does not account for the concepts of parts of speech, verb tenses or temporal aspects explicitly. For example, the words “is”, “grasping”, “has”, “grasped”, “just”, and so on, are initially completely distinct and unrelated to the model, which has no prior information about what verbs, adjectives or nouns are, nor about similarity between words. It is only through the association with the other robot observations that the model realizes that “grasping” has the same meaning as “grasped”. The following three phrases, which were used interchangeably in the experiments, are mapped to exactly the same meaning, after learning: (i) “is grasping”, (ii) “has grasped”, (iii) “grasped”. Note that the model per se would be fully capable to distinguish between those phrases, provided that they were used in different situations, which however was not the case in our experimental data.

IV-B Gesture Recognition

In this work, we consider three independent, multiple-state Hidden Markov Models, each of them trained to recognize one of the considered manipulation gestures of Fig. 1. The 3D coordinates of the human limbs and torso used to extract the input to the gesture recognizer are obtained with a commodity depth sensor (Kinect)444Currently, our gesture recognition algorithm relies on human skeleton tracking software from a depth stream. In our experience, the hand tracking is not reliable in the presence of a tabletop (i.e., partially occluded human) as in Fig. 1, so we record the same gestures twice, with and without the table: the latter is used for ensuring the robustness of the estimated hand coordinate, the former is used throughout the rest of our model and experiments. We plan to overcome this limitation in future work..

V Results

In this section, we report the experimental findings obtained with our proposed model. Because it is based on Bayesian Networks, the model can make inferences over any set of its variables $X_{\text{inf}}$ , given any other set of observed variables $X_{\text{obs}}$ . In particular, the model can do reasoning on the elements that constitute our computational concept of affordances, i.e., Action, Object Features, Effects in Fig. 2. Furthermore, it can do reasoning over Words. We present the following types of results:

•

inferences over affordance variables (see Table I) in Sec. V-A, V-B, V-C;

•

predictions of word probabilities in Sec. V-D;

•

verbal descriptions generated from the word probabilities of the previous point, according to a formal grammar. The descriptions, in turn, can be interpreted to observe the emergence of certain language phenomena: Sec. V-E, V-F, V-G.

V-A Action Recognition

In this experiment, we test the ability of our approach to recognize actions. Both the Affordance–Words model and the gesture recognition model can each perform inference of the Action variable individually: the former by using the variables of Tab. I, the latter by using human gesture features. We show how our method performs the inference over Action in a joint way. This includes dealing with information with different degrees of confidence, or conflicting information.

The scene of Fig. 3 contains a small ball which, after the manipulative action, exhibits a low velocity. Based on the evidence, the affordance model gives the highest probability $P_{\text{BN}}(A\mid X_{\text{obs}})$ to the action touch, which usually does not result in any movement of the object. However, in this particular simulated situation, we assume that the action performed by the human was an (unsuccessful) tap, that is, a tap that does not result in any movement for the object. In the simulation we show the effect of augmenting the inference with information from a gesture recognizer, that is, computing $P_{\text{comb}}(A\mid X_{\text{obs}},G_{1}^{T})$ . We analyze the effect of varying the degree of confidence of the classifier. We start from a uniform posterior $P_{\text{HMM}}(A\mid G_{1}^{T})$ , corresponding to a poor classifier, and gradually increase the probability of the correct action until it reaches $1$ . In this particular example, in order to win the belief of the affordance model, the action recognition needs to be very confident ( $P_{\text{HMM}}(A=\text{tap}\mid G_{1}^{T})>0.81$ ).

V-B Effect Prediction

We now show how our approach does inference over a different variable (instead of the Action one which is common between Affordance–Words model and gesture model), i.e., how it predicts the value of the object velocity effect variable. We will do this by using different degrees of probabilistic confidence about the action, and analyzing the outcome in terms of velocity prediction. This experiment exposes that all the variables of Tab. I jointly link robot and human, not only the Action variable, for the reasons expressed in Sec. III.

Fig. 4 shows the considered inference in two cases: when the prior information says that the shape is spherical (see Fig. 4a), and when it is cubic (see Fig. 4b).

The leftmost distribution in both figures shows the prediction of object velocity from the Affordance–Words model alone, without any additional information. When the shape is spherical, the model is not sure about the velocity, whereas if the shape is cubic, the model does not expect high velocities. If we add clear evidence on the action touch from the action recognition model, suddenly the combined model predicts slow velocities in both cases, as expected. However, if the action recognition evidence is gradually changed from touch to tap, the predictions of the model depend on the shape of the object. Higher velocities are expected for spherical objects that can roll, compared to cubic objects.

V-C Effect Anticipation

Since the gesture recognition method interprets sequences of human motions, we can test this predictive ability of the complete model when we observe an incomplete action. Fig. 5 shows an example of this where we reason about the expected object velocity caused by a tap action. Fig. 5a shows the action performed on a spherical object, whereas Fig. 5b on a cubic one. The graphs on the left side show the time evolution of the evidence $P_{\text{HMM}}(A\mid G_{1}^{t})$ from the gesture recognition model. In order to make the variations emerge more clearly, instead of the posterior, we show $\frac{1}{t}\log\mathcal{L}_{\text{HMM}}(G_{1}^{t}\mid A)$ : the log-likelihood normalized by the length of the sequence. Note how, in both cases, the correct action is recognized by the model given enough evidence, although the observation sequence is not complete. The right side of the plot shows the prediction of the object velocity, given the incomplete observation of the action and the object properties. The model correctly predicts that the sphere will probably move but the box is unlikely do so. Finally, the captions in the figure also show the verbal description (see Sec. III-D) generated by feeding the probability distribution of the words estimated by the model given the evidence into the context-free grammar.

V-D Prediction of Word Probabilities

Our model permits to make predictions over the word variables associated to affordance evidence. In Fig. 6 we show the variation in word occurrence probabilities between two cases:

when the robot’s prior knowledge evidence consists of information about object features and effects only: {Size=big, Shape=sphere, ObjVel=fast}; 2. 2.

when the evidence corresponds to the one of the previous point, with the addition of the tap action observed from the gesture recognizer (hard evidence).

In this result, we notice two facts. First, the probabilities of words related to tapping and pushing increase when a tapping action evidence from gesture recognition is introduced; conversely, the probabilities of other action words (touching and poking) decreases. Second, the probability of the word “rolling” (which is an effect of an action onto an object) also increases when the tap action evidence is entered.

V-E Verbal Descriptions and Choice of Synonyms

By generating and scoring natural language descriptions of what the robot observes (see Sec. III-D), we can provide evidence to the model and interpret the verbal results. Recall that, with our method, we do not add new words to the model when we observe the human performing actions. Rather, the human-readable descriptions that we generate are based on the same words that were present in the self-centered learning phase. In this phase, the verbal descriptions described the agent of the observed actions is either “the robot”, “he”, or “Baltazar” (the name of the robot). Consequently, the Affordance–Words model learned by the robot includes those words as the subject of the action.

As an example, by providing the evidence {Color=yellow, Size=big, Shape=sphere, ObjVel=fast} to the model, we obtain the sentences reported in Table II. The higher the score, the better. In many of these sentences, we note that (i) the correct verb related to the tap action is generated (in the initial evidence, no action information was present, only object features and effects information were), and (ii) the object term “ball” or synonyms thereof (e.g., “sphere”) are used coherently, both in the first part of the sentence describing the action and in the second part describing the effect. The fact that different synonyms may be used in the same sentence is simply a consequence of the random generation of sentences, described in Sec. III-D, and of the fact that usually synonyms are assigned similar (but not necessarily equal) probabilities by the model, given the same evidence.

V-F Language Phenomenon: Choice of Correct Conjunction

The manipulation experiments that we consider have the following structure: an agent (human or robot) performs a physical action onto an object with certain properties, and this object will produce a certain physical effect as a result. For example, a touch action on an object yields no physical movement, but a tap does (especially if the object is spherical). In the language description associated to an experiment, it makes sense to measure the conjunction chosen by the model given specific evidence. In particular, it would be desirable to separate two kinds of behaviors: one in which the action and effect are coherent (expected conjunction: “and”), and the other one in which they are contradictory (“but”).

Fig. 7 shows an example of this behavior of the model. We give the same action value grasp to the model as evidence, but two different values for the final object velocity. When the object velocity is medium (Fig. 7a), the model interprets this as a successful grasp and uses the conjunction “and” to separate the description of the action from the description of the effect. When the object velocity is slow (in the clustering procedure, the velocity was most often zero in those cases), the model predicts that this is an unsuccessful grasp and uses the conjunction “but”, instead.

V-G Language Phenomenon: Description of Object Features

In Fig. 8, we show examples of verbal descriptions generated by the model given different values of observed evidence:

•

$X_{\text{obs}}=\{\text{Action=grasp, Color=green1, Shape=box}\}$ (8a);

•

$X_{\text{obs}}=\{\text{Action=touch, Color=green1, Shape=box}\}$ (8b);

•

$X_{\text{obs}}=\{\text{Action=grasp, Color=green2, Shape=sphere}\}$ (8c);

•

$X_{\text{obs}}=\{\text{Action=touch, Color=green2, Shape=sphere}\}$ (8d).

Note that the box object in the two first examples has a dark shade of green (value of Color affordance variable of Table I clustered as: green1), whereas the spherical one in the two last examples has a lighter shade (Color value: green2). However, the verbal descriptions reported in Fig. 8 all use the adjective “green”. This behavior emerges from fact that the robot develops its perceptual symbols (clusters) in an early phase, and only subsequently associates them with the human vocabulary. We believe that this phenomenon is practical and potentially useful (i.e., the possibility that a low-level fine-grained robot representation can be abstracted into a high-level language description, which bundles the two shades of green under the same word).

VI Conclusions and Future Work

We presented a model that allows a robot to interpret and describe the actions of external agents, by reusing the knowledge previously acquired in an ego-centric manner. In a developmental setting, the robot first learns the link between words and object affordances by exploring its environment. Then, it uses this information to learn to classify the gestures and actions of another agent. Finally, by fusing the information from the two probabilistic models, in our experiments we show that the robot can reason over affordances and words when observing the other agent; this can also be leveraged to do early action recognition (see Sec. V-C). Although the complete model only estimates probabilities of single words given the evidence, we showed that feeding these probabilities into a pre-defined grammar produces human-interpretable sentences that correctly describe the situation. We also highlighted some interesting language-related properties of the model, such as: congruent/incongruent conjunctions, choice of appropriate synonym words, describing object features with general words.

Our demonstrations are based on a restricted scenario (see Sec. IV), i.e., one human and one robot manipulating simple objects on a shared table, a pre-defined number of motor actions and effects, and a vocabulary of approximately $50$ words to describe the experiments verbally. However, one of the main strengths of our study is that it spans different fields such as robot learning, language grounding, and object affordances. We also work with real robotic data, as opposed to learning images-to-text mappings (as in many works in computer vision) or using robot simulations (as in many works in robotics).

In terms of scalability, note that our BN model can learn both the dependency scructure and the parameters of the model from observations. The method that estimates the dependency structure, in particular, is sensitive to biases in the data. Consequently, in order to avoid misconceptions, the robot needs to explore any possible situation that may occur. For example, if the robot only observes blue spheres rolling, it might infer that it is the color that makes the object roll, rather than its shape. In order to scale the method to a larger number of concepts, it would be necessary to scale the amount of data considerably, similarly to what is typically done in deep learning. In models of developmental robotics, where this is neither practically feasible, nor desirable, we would need to devise methods that can generalize more efficiently from very few observations.

As future work, it would be useful to investigate how the model can extract syntactic information from the observed data autonomously, thus relaxing the bag-of-words assumption in the current model. Another line of research would be to study how the model can guide the discovery of new acoustic patterns (e.g., [49, 50, 51]), and how to incorporate the newly discovered symbols into our Affordance–Words model. This would release our current assumption of a pre-defined set of words.

Appendix A Grammar Definition

Below, we provide the grammar definition used to generate verbal descriptions from the probability distribution over words estimated by the model. Note, however, that no grammar was used during the learning phase: the speech recognizer used as a frontend to the spoken descriptions is based on a loop of words with no grammar, and the Affordance–Words model is based on a bag-of-words assumption, where only the presence or absence of each word in the description is considered. The symbol .|. represents alternative items, while the symbol [.] optional items. Non-terminal symbols are given between <.> in italics, while words (terminal symbols) are given in plain text and font: thus, the full set of words is given by all the plain text words below.

{grammar}

<agent> ::= the robot | he | baltazar

<touch> ::= touches | [has] [just] touched | is touching

<poke> ::= pokes | [has] [just] poked | is poking

<tap> ::= taps | [has] [just] tapped | is tapping

<push> ::= pushes | [has] [just] pushed | is pushing

<grasp> ::= grasps | [has] [just] grasped | is grasping

<pick> ::= picks | [has] [just] picked | is picking

<size> ::= big | small

<color> ::= green | yellow | blue

<shape> ::= sphere | ball | cube | box | square

<conjunction> ::= and | but

<inertmove> ::= is inert | is still | moves | is moving

<slideroll> ::= slides | is sliding | rolls | is rolling

<fallrise> ::= rises | is rising | falls | is falling

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Victor Witter Turner “Dramas, Fields, and Metaphors: Symbolic Action in Human Society” Cornell University Press, 1975
2[2] Celia A. Brownell, Geetha B. Ramani and Stephanie Zerwas “Becoming a social partner with peers: Cooperation and social understanding in one- and two-year-olds” In Child Development 77.4 , 2006, pp. 803–821 DOI: 10.1111/j.1467-8624.2006.t 01-1-.x-i 1 · doi ↗
3[3] Alicia P. Melis and Dirk Semmann “How is human cooperation different?” In Philosophical Transactions of the Royal Society of London B: Biological Sciences 365.1553 , 2010, pp. 2663–2674 DOI: 10.1098/rstb.2010.0157 · doi ↗
4[4] Narender Ramnani and R. Miall “A system in the human brain for predicting the actions of others” In Nature Neuroscience 7.1 , 2004, pp. 85–90 DOI: 10.1038/nn 1168 · doi ↗
5[5] Cynthia Breazeal “Designing Sociable Robots” MIT Press, 2002
6[6] Naoto Iwahashi “Robots that learn language: A developmental approach to situated human–robot conversations” In Human Robot Interaction In Tech, 2007 DOI: 10.5772/5188 · doi ↗
7[7] Luc Steels “Evolving grounded communication for robots” In Trends in Cognitive Sciences 7.7 , 2003, pp. 308–312 DOI: 10.1016/S 1364-6613(03)00129-3 · doi ↗
8[8] Max Lungarella, Giorgio Metta, Rolf Pfeifer and Giulio Sandini “Developmental robotics: a survey” In Connection Science 15.4 , 2003, pp. 151–190 DOI: 10.1080/09540090310001655110 · doi ↗