Deep learning control of artificial avatars in group coordination tasks
Maria Lombardi, Davide Liuzza, Mario di Bernardo

TL;DR
This paper presents a deep reinforcement learning-based control architecture for artificial agents to achieve human-like coordination within groups during joint tasks, addressing a less explored multi-agent scenario.
Contribution
It introduces a novel deep learning control method enabling artificial agents to synchronize with human groups in multi-agent coordination tasks.
Findings
Successful synthesis of artificial agents that coordinate with human groups
Agents exhibit human-like kinematic features during group tasks
Demonstrated effectiveness on the mirror-game benchmark
Abstract
In many joint-action scenarios, humans and robots have to coordinate their movements to accomplish a given shared task. Lifting an object together, sawing a wood log, transferring objects from a point to another are all examples where motor coordination between humans and machines is a crucial requirement. While the dyadic coordination between a human and a robot has been studied in previous investigations, the multi-agent scenario in which a robot has to be integrated into a human group still remains a less explored field of research. In this paper we discuss how to synthesise an artificial agent able to coordinate its motion in human ensembles. Driven by a control architecture based on deep reinforcement learning, such an artificial agent will be able to autonomously move itself in order to synchronise its motion with that of the group while exhibiting human-like kinematic features.…
| Metric | CP | VP target |
|---|---|---|
| Relative phase | ||
| RMS | ||
| Time lag |
| Metric | CP | VP target |
|---|---|---|
| Relative phase | ||
| RMS | ||
| Time lag |
| Metric | CP | VP target |
|---|---|---|
| Relative phase | ||
| RMS | ||
| Time lag |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Deep learning control of artificial avatars in group coordination tasks
Maria Lombardi1, Davide Liuzza2 and Mario di Bernardo3 1 M. Lombardi is with the Department of Engineering Mathematics, University of Bristol, UK [email protected] and with the Department of Electrical Engineering and Information Technology, University of Naples “Federico II”, Italy [email protected]2 D. Liuzza is with the Department of Engineering, University of Sannio, Benevento, Italy [email protected]3 M. di Bernardo is with the Department of Engineering Mathematics, University of Bristol, UK [email protected] and with the Department of Electrical Engineering and Information Technology, University of Naples “Federico II”, Italy [email protected]
Abstract
In many joint-action scenarios, humans and robots have to coordinate their movements to accomplish a given shared task. Lifting an object together, sawing a wood log, transferring objects from a point to another are all examples where motor coordination between humans and machines is a crucial requirement. While the dyadic coordination between a human and a robot has been studied in previous investigations, the multi-agent scenario in which a robot has to be integrated into a human group still remains a less explored field of research. In this paper we discuss how to synthesise an artificial agent able to coordinate its motion in human ensembles. Driven by a control architecture based on deep reinforcement learning, such an artificial agent will be able to autonomously move itself in order to synchronise its motion with that of the group while exhibiting human-like kinematic features. As a paradigmatic coordination task we take a group version of the so-called mirror-game which is highlighted as a good benchmark in the human movement literature.
I INTRODUCTION
Interest in using robots and artificial avatars in joint tasks with humans to reach a common goal is growing rapidly. Indeed, it is possible to find numerous applications that span from industrial tasks to entertainment, from navigation to orientation and so on, in which artificial agents are required to interact cooperatively with people. Examples of human-robot interaction include the problem of jointly handling an object [1], sawing a log of wood [2], managing a common work-piece in a production system [3], or performing a “pick and place” coordination task [4]. While dyadic coordination between humans and robots is the subject of much ongoing research, the problem of having robots or avatars interacting with a human team remains a seldom investigated field. This is probably due to the complex mechanisms underlying interpersonal coordination, the different ways in which coordination can emerge in human groups, and the potentially large amount of data to be collected and processed in real-time.
From a control viewpoint, the emergence of multi-agent synchronisation while performing a joint task is a phenomenon characterised by non-linear dynamics in which an individual has to predict what others are going to do and adjust his/her movements in order to complement the movements of the others in order to achieve precise and accurate temporal correspondence [5].
In this context, an open question is to investigate whether it is possible to influence the emergent group behaviour via the introduction of artificial agents able be accepted in a natural way by the group and help it achieving a collective control goal. To reach such a goal, the artificial agent has to be able to integrate its motion with that of the others exhibiting at the same time typical human-like kinematic features. In this way, the artificial agent can merge with the rest of the group and enhance, rather disrupt, social attachment between its members and group cohesiveness.
The problem is crucial in social robotics, where new advancements in human-robot interaction can promote novel diagnostic and rehabilitation strategies for patient suffering from social and motor disorders [6].
In this work we define a “human-like” motion by using the concept of “Individual Motor Signature” (IMS), proposed in [7] as a valid biomarker able to capture the peculiarity of the human motion. Specifically, the IMS has been defined in terms of the probability density function (PDF) of the velocity profiles characterising a specific joint task.
The aim of this paper is to present a control architecture based on deep learning to drive an artificial agent able in performing a joint task in a multi-agent scenario while exhibiting a desired IMS. As a scenario of interest, we take a multiplayer version of the mirror game proposed in [8] as a paradigmatic task of interpersonal motor coordination. In our version of the mirror game, first proposed in [7], a group of players is asked to oscillate a finger sideways performing some interesting motion and synchronising theirs with that of the others (see [9] for further details).
The approach we follow is an extension to groups of the strategy we presented in [10, 11, 12] in the case of dyadic interactions. Specifically, in [12] we designed an autonomous cyber player able to play dyadic leader-follower sessions of the mirror game with different human players. Such a cyber player was driven by a Q-learning algorithm aiming at exhibiting the kinematic features of a target human player in order to emulate hers/his way of moving when engaged in a dyadic interaction.
Extending the Q-learning approach to multi-agent systems is cumbersome as the approach is unscalable with the growth of the system state space due to the addition of other players. To overcome this limitation, we use “deep reinforcement learning” [13, 14, 15], combining the reinforcement learning strategy with the powerful generalization capabilities of neural networks. To design the control architecture of the cyber player, we collect motor measurements signals of four different players involved in a joint oscillatory task and then train the CP to mimic the way of moving of one of them. The validation is done replacing the target player with the cyber player and comparing the group performance in order to prove the effectiveness of the proposed control approach.
II PRELIMINARIES
A group of people interacting with each other can be described as a complex network system in which each individual is represented as a node (or agent) with its own dynamics while the visual coupling with the other members of the group as edges in a graph describing the network of their interactions. The structure of the interconnections established among the groups’ members is formalised by the adjacency matrix , in which the element only if the node is linked with the node , or in other terms if the individual is visually coupled with the individual .
Four different topologies are considered in this work as shown in Fig. 1. As described in [9] these different topologies can be implemented experimentally by changing the way in which participants sit with respect to each other and by asking them to wear appropriate goggles restricting their field-of-view.
Let be the phase of the th agent estimated by taking the Hilbert transform of its position signal, say . The cluster phase of agents is defined as , which represents the average phase of the group at time . The term is the relative phase between the th agent and the group phase at time .
The level of coordination reached by a human group performing an oscillatory task can be investigated by evaluating the group synchronisation index introduced in [9, 16] and defined as follows:
[TABLE]
where is averaged over time. Closer the synchronisation index is to , higher is the level of synchronisation in the group.
III DEEP REINFORCEMENT LEARNING APPROACH
III-A Brief overview
Reinforcement learning is a machine learning technique in which an agent tries to learn how to behave in an unknown environment taking, in any situation, the best action that it can perform. The problem is formalized by considering a set of all possible states in which the environment can be (state-space), a set of all possible actions that the agent can take (action-space) and an auxiliary function, named action-value function (or Q-function), that quantifies the expected return (reward) starting from a specific state and taking a specific action. Through the action-value function, the goal of the learning agent is to iteratively refine its policy in order to maximise the expected reward. Solving a problem with the classical Q-learning approach [17] means to iteratively explore all possible combinations between the set and the set in order to evaluate them in terms of action-value functions in a tabular form. As this is unfeasible in our group scenario, we use the deep learning control approach shown in Fig. 2 where:
- •
the state space is where are position and velocity of the CP, while are mean position and mean velocity of the players connected to the target players;
- •
the action space is made up of different values of acceleration in , empirically chosen looking at the typical human accelerations while performing the same joint tasks;
- •
the reward function is selected as where are position and velocity of the target player, while the constant parameter tunes the control effort. Maximising a reward function so designed means to minimise the squared error both in position and in velocity between the CP and the target player;
- •
the policy according to which the CP chooses the action to take in a specific state is an -greedy policy [13]. Following that policy, the CP takes the best known action with probability (exploitation), whereas with probability it takes a random action (exploration). The value follows a monotonic decreasing function, since as time increases the exploration phase is replaced by the exploitation phase.
In particular, we exploit the Deep Q-network (DQN) strategy where an artificial neural network (ANN) is used to approximate the optimal action-value function defined as:
[TABLE]
which maximises the expected value of the sum of the rewards discounted by a positive factor , obtained taking the action in the state following the policy at any time instant .
Training an ANN in order to approximate a desired function (Q-function) means to find the vector of network weights of the connections between the neurons, iteratively evaluated by back-propagation algorithms in order to minimise a loss function. The loss function is used to measure the error between the actual and the predicted output of the neural network (e.g. mean squared error) (see [18] for further details).
Contrary to what is done in traditional supervised learning with ANN where the predicted output is well defined, in the Deep Q-network approach the loss function is iteratively changed because the predicted output itself depends on the network parameters at every instant . Namely, the loss function is chosen as:
[TABLE]
which represents the mean squared error between the current estimated function and the approximate optimal action-value function.
It has been proved that an ANN with a single hidden layer containing a large enough number of sigmoid units can approximate any continuous function, while a second layer is added to improve accuracy [19]. Relying on that, the neural network we considered to approximate the action-value function in (2) is designed as a feedforward network with (Fig. 2):
- •
an input layer with different nodes, one for each state variable ;
- •
two hidden layers, empirically found, made up of and nodes each implementing a sigmoidal activation function;
- •
an output layer with different nodes, one for each action available in the set . In the DQN, the network output returns the estimated action-value for each possible action in a single shot reducing in this way the time needed for the training. Then, the action corresponding to the maximum q-value (neuron’s output) is chosen as the next control input.
Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator, such as an ANN, is used to estimate the Q-function [13, 14]. According to the existing literature, this instability is caused by: (i) the presence of correlation in the sequence of observed states and (ii) the presence of correlation between the current estimated and the target network, resulting in the loss of the Markov property. The correlation in the observation sequence is removed by introducing the experience replay mechanism, where the observed states used to train the ANN are not taken sequentially but are sampled randomly from a circular buffer. Also, the correlation between the current estimate of the function and the target optimal one is reduced updating the latter at a slower rate.
III-B CP Implementation
According to the reinforcement learning strategy with the Deep Q-network described above, the CP refines its policy according to the system states and the reward received so as to take the best action it can to mimic the target player(s). To implement the DQN as a first step, a feedforward neural network needs to be initialized with random values. The experience replay mechanism is implemented instantiating an empty circular buffer in order to store the system’s state at each iteration.
Then at each iteration we have (Fig. 2):
the CP observes the process’s state at time instant and performs an action according to the policy , that is an -greedy policy; 2. 2.
the process evolves to a new state and the CP receives the reward that measures how good taking the action in the state has been; 3. 3.
the new sample is added to the circular buffer and a random batch taken from it is used to train the NN. The training is done through the gradient descend back-propagation algorithm with momentum [13] so as to tune the network’s weights in order to minimize the loss function (3). We denote the network’s weights between the layer and at instant as .
The steps above are repeated until convergence is achieved according to the “termination criterion”:
[TABLE]
where is the root mean square error between the position of the player and the mean position of the group, while is a non-negative parameter.
IV TRAINING AND VALIDATION
IV-A Training
Ideally, data used to train the CP are extracted from real human players playing the mirror game. In our case, due to the lack of a large enough dataset, the data needed to feed the CP during the training are generated synthetically making artificial agents modelling human players perform sessions of the mirror game against each other. We refer to these other artificial agents as Virtual Players (VP) to distinguish them from the CP since they are driven by a completely different architecture which is not based on AI and was presented in [10] and improved in [11]. The use of virtual players for training AI based CPs was first proposed in [12] for dyadic interaction and is applied here for the first time to the multi-player version of the game.
Specifically, the motion of the virtual player used to generate synthetic data is that of a controlled nonlinear HKB oscillator [20] of the form:
[TABLE]
where and are position, velocity and acceleration of the VP end effector, respectively, are positive empirically tuned damping parameters while is the oscillation frequency. The control input is chosen following an optimal control strategy aiming at minimising the following cost function [21]:
[TABLE]
where is the position and the velocity time series of the partner player, is the reference signal modelling the desired human motor signature, tunes the control effort, and represent the current and the next optimization time instant. are positive control parameters satisfying the constraint . By tuning appropriately these parameters, it is possible to change the VP configuration making it act as a leader, follower or joint improviser (more details are in [10, 21]). In the case of a multi-player scenario, and are taken as the mean value of the position and the velocity of the target player’s neighbours, that is:
[TABLE]
where is the number of neighbours and and are the position and the velocity of the th neighbour.
The reference signal captures in some way the desired human kinematic features that the VP has to exhibit during the game. In [11] we developed a methodology based on the theory of stochastic processes and observational learning to generate human-like trajectories in real time. In particular, a Markov Chain (MC) was derived to capture the peculiar internal description model of the motion of a human player simply observing him/her playing sessions of the mirror game in isolation.
To train the CP to coordinate its movements in the group like the virtual player target does, a group of different virtual players interconnected in a complete graph were used (Fig. 3). In particular we selected four Markov models (one for each player) of different human players which were parametrized in [11]. Without loss of generality, VP was taken as the target player the Deep Learning driven CP has to mimic.
The parameters proposed for the control architecture of the VPs were tuned experimentally as follows: for the inner dynamics, and for the control law. The experience replay in the CP algorithm was implemented with a buffer of elements, batches of sampled states were used to train the feedforward neural network at each iteration. A target network updated every time steps was considered in the Q-function, with a discount factor .
The training stage was carried out on a Desktop computer having an Intel Core i7-6700 CPU, 16 GB of RAM and 64-bit Windows operative system. It took trials of observations each to converge (around hours) according to the criterion (4). In Fig. 4 the training curve is reported showing for each trial the RMS between the VP and the group (in blue), and between the CP and the group (in red). Convergence is reached in about trials on.
IV-B Validation
To show that the CP is effectively able to emulate the VP target when engaged in a group, training was carried out by considering a group described by a complete graph, we then validated the performance of the CP when interacting over different topologies. Specifically we used the ring graph, path graph, star graph (described in Sec. II) with node as center. For the sake of brevity, in Fig. 5 only the validation for the complete and the path graph are reported. The performance of the CP has been evaluated by comparing its behaviour with that of the target virtual player it was trained to mimic. The CP (in red) successfully tracks the mean position of the group (dashed line in black) being able to mimic the target player it has been trained to imitate (in blue) [panel (a)]. The relative position error (RPE) defined as
[TABLE]
has been also evaluated between the VP target and the mean position of the neighbours and compared with the relative position error between the CP and the same mean position [panel (c)]. Both the errors are very small and with comparable mean values.
Similar considerations can be done for behaviour of the CP when the group topology is a path graph as shown in panels (b) and (d).
The key features of the motion of the CP and the VP it has been trained upon are captured by the following metrics: 1) relative phase error defined as , 2) the RMS error between the position of the CP (or VP) and that of the group mean position, and 3) the time lag which describes the amount of time shift that achieves the maximum cross-covariance between the two position time series. This can be interpreted as the average reaction time of the player in the mirror game [9].
The metrics described above were evaluated performing trials and reporting both the mean value and the standard deviation for both the complete graph (Tab. I(a)) and the path graph (Tab. I(b)). It is possible to notice that all indexes show a remarkable degree of similarity between the motion of the CP and that of the target VP.
For further evidence, in Fig. 6 the group level synchronization is reported for each tested topology. Despite the different topologies, the presence of the CP does not alter the group dynamics when the CP is substituted to the VP it was trained upon. We notice that the level of coordination varies with the topology, confirming what found in [9]. Specifically, in [9] as confirmed by Fig. 6, the complete and the star graph were found to be associated with the highest level of synchronization.
V CONCLUSIONS
In this work we addressed the problem of synthesizing an autonomous artificial agent able to coordinate its movements and perform a joint motor task in a group setting. In particular, a multiplayer version of the mirror game was used as a paradigmatic task where different individuals have to synchronize their oscillatory motion. To achieve our goal and overcome the limitations of previous approaches, we introduced a deep reinforcement learning control algorithm in which a feedforward neural network is used to approximate the nonlinear action-value function. The DQN allowed us to overcome the limitations of the Q-learning approach presented in [12] which is impractical when the state space becomes too large, as in the case of multiplayer coordination tasks. The effectiveness of the cyber player trained upon a target group member was shown by comparing its performance when playing in groups with different interconnection topologies. The numerical validations show the effectiveness of our approach. Ongoing work is being carried out to validate the behaviour of the CP when interacting with a real group of people through the experimental platform Chronos we presented in [22].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Edsinger and C. C. Kemp, “Human-Robot Interaction for Cooperative Manipulation : Handing Objects to One Another,” in IEEE International Symposium on Robot and Human Interactive Communication , 2007, p. e 9825497.
- 2[2] L. Peternel, T. Petrič, E. Oztop, and J. Babič, “Teaching robots to cooperate with humans in dynamic manipulation tasks based on multi-modal human-in-the-loop approach,” Autonomous Robots , vol. 36, no. 1-2, pp. 123–136, 2014.
- 3[3] M. Faber, J. Bützler, and C. M. Schlick, “Human-robot Cooperation in Future Production Systems: Analysis of Requirements for Designing an Ergonomic Work System,” Procedia Manufacturing , vol. 3, pp. 510–517, 2015.
- 4[4] M. Lamb, T. Lorenz, S. J. Harrison, R. W. Kallen, A. Minai, and M. J. Richardson, “PAP Ac: A Pick and Place Agent Based on Human Behavioral Dynamics,” Proceedings of the 5th International Conference on Human Agent Interaction - HAI ’17 , pp. 131–141, 2017.
- 5[5] C. Vesper, S. Butterfill, G. Knoblich, and N. Sebanz, “A minimal architecture for joint action,” Neural Networks , vol. 23, no. 8-9, pp. 998–1003, 2010.
- 6[6] I. Amado, L. Brénugat-Herné, E. Orriols, C. Desombre, M. Dos Santos, Z. Prost, M. O. Krebs, and P. Piolino, “A serious game to improve cognitive functions in schizophrenia: A pilot study,” Frontiers in Psychiatry , vol. 7, pp. 1–11, 2016.
- 7[7] P. Słowiński, C. Zhai, F. Alderisio, R. Salesse, M. Gueugnon, L. Marin, B. G. Bardy, M. di Bernardo, and K. Tsaneva-Atanasova, “Dynamic similarity promotes interpersonal coordination in joint action,” Journal of The Royal Society Interface , vol. 13, no. 116, p. 20151093, 2016.
- 8[8] L. Noy, E. Dekel, and U. Alon, “The mirror game as a paradigm for studying the dynamics of two people improvising motion together,” Proceedings of the National Academy of Sciences , vol. 108, no. 52, pp. 20 947–20 952, 2011.
