Miss Tools and Mr Fruit: Emergent communication in agents learning about   object affordances

Diane Bouchacourt; Marco Baroni

arXiv:1905.11871·cs.CL·May 29, 2019

Miss Tools and Mr Fruit: Emergent communication in agents learning about object affordances

Diane Bouchacourt, Marco Baroni

PDF

1 Repo

TL;DR

This paper investigates how deep network agents develop communication protocols about object affordances, revealing that symmetry alone does not ensure the emergence of a shared language, with agents creating multiple idiolects.

Contribution

It introduces a new task simulating human-like object affordance understanding and analyzes the conditions for genuine bilateral communication among agents.

Findings

01

Agents solve the task through referential communication.

02

Multiple idiolects emerge among agents.

03

Full symmetry does not guarantee a common language.

Abstract

Recent research studies communication emergence in communities of deep network agents assigned a joint task, hoping to gain insights on human language evolution. We propose here a new task capturing crucial aspects of the human environment, such as natural object affordances, and of human conversation, such as full symmetry among the participants. By conducting a thorough pragmatic and semantic analysis of the emergent protocol, we show that the agents solve the shared task through genuine bilateral, referential communication. However, the agents develop multiple idiolects, which makes us conclude that full symmetry is not a sufficient condition for a common language to emerge.

Tables5

Table 1. Table 1 : Test performance and pragmatic measures mean and SEM in different settings. ‘‘Av. perf.’’ (average performance) denotes % percent \% of samples where best tool was chosen, ‘‘Bi. comm." denotes % percent \% of games with bilateral communication taking place. ‘‘Av. conv. length" is average conversation length in turns. ‘‘T chooses" denotes % percent \% of games ended by Tool Player. Values of ME with an asterisk ∗ \leavevmode\nobreak\ {}^{*} are statistically significantly higher than their reverse (e.g. ME F → T > ME T → F superscript ME → 𝐹 𝑇 superscript ME → 𝑇 𝐹 \text{ME}^{F\rightarrow T}>\text{ME}^{T\rightarrow F} ). Best ‘‘Av. perf.’’ and ‘‘Bi. comm." in bold.

		No communication		With communication
	Metric	In	Transfer	In	Transfer
No memory	Av. perf. (%)	$84.83 \pm 0.09$	$84.0 \pm 0.11$	$96.9 \pm 0.32$	$94.5 \pm 0.37$
	$M E^{F \to T}$	${0.133}^{*} \pm 0.01$	${0.14}^{*} \pm 0.01$	${5.0}^{*} \pm 0.39$	${5.0}^{*} \pm 0.36$
	$M E^{T \to F}$	$0.05 \pm 0.02$	$0.030 \pm 0.01$	$3.9 \pm 0.38$	$3.3 \pm 0.30$
	$M E^{1 \to 2}$	$0.066 \pm 0.00$	$0.067 \pm 0.01$	$3.9 \pm 0.29$	$3.7 \pm 0.26$
	$M E^{2 \to 1}$	${0.12}^{*} \pm 0.02$	${0.10}^{*} \pm 0.01$	${5.0}^{*} \pm 0.38$	${4.7}^{*} \pm 0.33$
	Bi. comm. (%)	$1.4 \pm 0.31$	$1.3 \pm 0.40$	$𝟖𝟖 \pm 2.49$	$𝟖𝟗 \pm 2.24$
	Av. conv. length	$0.508 \pm 0.01$	$0.52 \pm 0.01$	$2.16 \pm 0.08$	$2.21 \pm 0.10$
	T chooses (%)	$99.4 \pm 0.63$	$99.6 \pm 0.56$	$85 \pm 2.09$	$83 \pm 2.56$
With memory	Av. perf. (%)	$88.5 \pm 0.11$	$87.7 \pm 0.16$	$97.4 \pm 0.12$	$95.3 \pm 0.16$
	$M E^{F \to T}$	${0.11}^{*} \pm 0.01$	${0.13}^{*} \pm 0.01$	${3.0}^{*} \pm 0.29$	${2.8}^{*} \pm 0.24$
	$M E^{T \to F}$	$0.064 \pm 0.01$	$0.071 \pm 0.01$	$1.8 \pm 0.22$	$1.8 \pm 0.21$
	$M E^{1 \to 2}$	$0.085 \pm 0.01$	$0.10 \pm 0.01$	$2.4 \pm 0.29$	$2.3 \pm 0.22$
	$M E^{2 \to 1}$	$0.093 \pm 0.01$	$0.103 \pm 0.01$	$2.4 \pm 0.22$	$2.4 \pm 0.21$
	Bi. comm. (%)	$3.8 \pm 0.61$	$4.6 \pm 0.68$	$78 \pm 2.55$	$78 \pm 2.65$
	Av. conv. length	$1.50 \pm 0.06$	$1.46 \pm 0.06$	$2.7 \pm 0.11$	$2.7 \pm 0.11$
	T chooses (%)	$87.3 \pm 1.34$	$85.8 \pm 1.48$	$81 \pm 2.94$	$81 \pm 3.00$

Table 2. Table 2 : Semantic classifier % accuracy mean and SEM over successful training seeds.

Messages	Fruit	Tool 1	Tool 2
Both	$37 \pm 1.70$	$31 \pm 1.21$	$24 \pm 1.07$
F	$37 \pm 1.75$	$23.3 \pm 0.66$	$16.7 \pm 0.51$
T	$14.1 \pm 0.79$	$32 \pm 1.17$	$25 \pm 1.04$
Stats. (%)	$5.786 \pm 0.00$	$8.76 \pm 0.01$	$7.682 \pm 0.01$

Table 3. Table A1 : M F subscript 𝑀 𝐹 M_{F} . Rows are dataset fruit features, columns are functional fruit features.

\tilde{p} ​ (z_{t + 1}^{B}) = \sum_{m_{t}^{A}} p ​ (z_{t + 1}^{B} | m_{t}^{', A}) ​ \tilde{p} ​ (m_{t}^{', A}) .

(3)

Table 4. Table A3 : Detailed ME values (compare to Table 1 in main paper). 1 T / 2 F 1 𝑇 2 𝐹 1T/2F denotes games where the Tool Player is in first position and the Fruit Player is in second position, and 1 F / 2 T 1 𝐹 2 𝑇 1F/2T denotes games where the Fruit Player in first position and Tool Player in second.

		No communication		With communication
	Metric	In	Transfer	In	Transfer
No memory	Av. perf. (%)	$84.83 \pm 0.09$	$84.0 \pm 0.11$	$96.9 \pm 0.32$	$94.5 \pm 0.37$
	$M E^{F \to T}$	${0.133}^{*} \pm 0.01$	${0.14}^{*} \pm 0.01$	${5.0}^{*} \pm 0.39$	${5.0}^{*} \pm 0.36$
	$M E^{T \to F}$	$0.05 \pm 0.02$	$0.030 \pm 0.01$	$3.9 \pm 0.38$	$3.3 \pm 0.30$
	$M E^{1 \to 2}$	$0.066 \pm 0.00$	$0.067 \pm 0.01$	$3.9 \pm 0.29$	$3.7 \pm 0.26$
	$M E^{2 \to 1}$	${0.12}^{*} \pm 0.02$	${0.10}^{*} \pm 0.01$	${5.0}^{*} \pm 0.38$	${4.7}^{*} \pm 0.33$
	$M E^{1 \to 2} 1 T / 2 F$	$0.000001 \pm 0.00$	$0.000001 \pm 0.00$	$3.7 \pm 0.46$	$3.0 \pm 0.34$
	$M E^{2 \to 1} 1 T / 2 F$	${0.13}^{*} \pm 0.01$	${0.15}^{*} \pm 0.01$	${5.8}^{*} \pm 0.50$	${5.7}^{*} \pm 0.47$
	$M E^{1 \to 2} 1 F / 2 T$	$0.133 \pm 0.01$	${0.13}^{*} \pm 0.01$	$4.2 \pm 0.34$	${4.4}^{*} \pm 0.34$
	$M E^{2 \to 1} 1 F / 2 T$	$0.10 \pm 0.03$	$0.06 \pm 0.02$	$4.2 \pm 0.39$	$3.6 \pm 0.29$
With memory	Av. perf. (%)	$88.5 \pm 0.11$	$87.7 \pm 0.16$	$97.4 \pm 0.12$	$95.3 \pm 0.16$
	$M E^{F \to T}$	${0.11}^{*} \pm 0.01$	${0.13}^{*} \pm 0.01$	${3.0}^{*} \pm 0.29$	${2.8}^{*} \pm 0.24$
	$M E^{T \to F}$	$0.064 \pm 0.01$	$0.071 \pm 0.01$	$1.8 \pm 0.22$	$1.8 \pm 0.21$
	$M E^{1 \to 2}$	$0.085 \pm 0.01$	$0.10 \pm 0.01$	$2.4 \pm 0.29$	$2.3 \pm 0.22$
	$M E^{2 \to 1}$	$0.093 \pm 0.01$	$0.103 \pm 0.01$	$2.4 \pm 0.22$	$2.4 \pm 0.21$
	$M E^{1 \to 2} 1 T / 2 F$	$0.063 \pm 0.01$	$0.064 \pm 0.01$	$1.8 \pm 0.24$	$1.8 \pm 0.23$
	$M E^{2 \to 1} 1 T / 2 F$	${0.12}^{*} \pm 0.02$	${0.13}^{*} \pm 0.02$	${2.9}^{*} \pm 0.25$	${2.9}^{*} \pm 0.25$
	$M E^{1 \to 2} 1 F / 2 T$	${0.106}^{*} \pm 0.01$	${0.13}^{*} \pm 0.02$	${3.1}^{*} \pm 0.35$	${2.8}^{*} \pm 0.25$
	$M E^{2 \to 1} 1 F / 2 T$	$0.065 \pm 0.01$	$0.077 \pm 0.01$	$1.8 \pm 0.21$	$1.9 \pm 0.20$

Table 5. Table A4 : Semantic classifier % accuracy in inverted-roles setup

Utterances	Fruit	Tool 1	Tool 2
Both, A is F	$42 \pm 2.21$	$32 \pm 2.04$	$27 \pm 1.33$
Both, B is F	$44 \pm 2.00$	$28 \pm 1.58$	$28 \pm 1.69$
Both, Train A is F / Test B is F	$6.8 \pm 0.61$	$11 \pm 1.16$	$8.8 \pm 0.68$
Both, Train B is F / Test A is F	$5.9 \pm 0.53$	$10 \pm 1.05$	$8.8 \pm 0.62$
Stats A is F	$6.4 \pm 0.27$	$8.9 \pm 0.42$	$8.2 \pm 0.38$
Stats B is F	$6.4 \pm 0.15$	$9.1 \pm 0.61$	$9.0 \pm 0.75$

Equations8

\tilde{p} (z_{t + 1}^{B}) = m_{t}^{A} \sum p (z_{t + 1}^{B} ∣ m_{t}^{' A}) \tilde{p} (m_{t}^{' A})

\tilde{p} (z_{t + 1}^{B}) = m_{t}^{A} \sum p (z_{t + 1}^{B} ∣ m_{t}^{' A}) \tilde{p} (m_{t}^{' A})

\displaystyle\text{ME}^{A\rightarrow B}_{t}=\text{KL}\Big{(}p(z^{B}_{t+1}|m^{A}_{t})||\tilde{p}(z^{B}_{t+1})\Big{)}

\displaystyle\text{ME}^{A\rightarrow B}_{t}=\text{KL}\Big{(}p(z^{B}_{t+1}|m^{A}_{t})||\tilde{p}(z^{B}_{t+1})\Big{)}

\tilde{p} (z_{t + 1, k}^{B}) = j = 1 \sum J p (z_{t + 1, k}^{B} ∣ m_{t, j}^{' A}) \tilde{p} (m_{t, j}^{' A}) .

\tilde{p} (z_{t + 1, k}^{B}) = j = 1 \sum J p (z_{t + 1, k}^{B} ∣ m_{t, j}^{' A}) \tilde{p} (m_{t, j}^{' A}) .

ME_{t}^{A \to B} = \frac{1}{K} k = 1 \sum K lo g \frac{p ( z _{t + 1, k}^{B} ∣ m _{t}^{A} )}{p ~ ( z _{t + 1, k}^{B} )} .

ME_{t}^{A \to B} = \frac{1}{K} k = 1 \sum K lo g \frac{p ( z _{t + 1, k}^{B} ∣ m _{t}^{A} )}{p ~ ( z _{t + 1, k}^{B} )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/fruit-tools-game
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11affiliationtext: Facebook A.I. Research22affiliationtext: ICREA affiliationtext: {dianeb,mbaroni}@fb.com

Miss Tools and Mr Fruit: Emergent communication in agents learning about object affordances

Diane Bouchacourt

Marco Baroni

Abstract

Recent research studies communication emergence in communities of deep network agents assigned a joint task, hoping to gain insights on human language evolution. We propose here a new task capturing crucial aspects of the human environment, such as natural object affordances, and of human conversation, such as full symmetry among the participants. By conducting a thorough pragmatic and semantic analysis of the emergent protocol, we show that the agents solve the shared task through genuine bilateral, referential communication. However, the agents develop multiple idiolects, which makes us conclude that full symmetry is not a sufficient condition for a common language to emerge.

1 Introduction

The advent of powerful deep learning architectures has revived research in simulations of language emergence among computational agents that must communicate to accomplish a task (e.g., Jorge et al., 2016; Havrylov and Titov, 2017; Kottur et al., 2017; Lazaridou et al., 2017; Lee et al., 2017; Choi et al., 2018; Evtimova et al., 2018; Lazaridou et al., 2018). The nature of the emergent communication code should provide insights on questions such as to what extent comparable functional pressures could have shaped human language, and whether deep learning models can develop human-like linguistic skills. For such inquiries to be meaningful, the designed setup should reflect as many aspects of human communication as possible. Moreover, appropriate tools should be applied to the analysis of emergent communication, since, as several recent studies have shown, agents might succeed at a task without truly relying on their communicative channel, or by means of ad-hoc communication techniques overfitting their environment Kottur et al. (2017); Bouchacourt and Baroni (2018); Lowe et al. (2019).

We contribute on both fronts. We introduce a game meeting many desiderata for a natural communication environment. We further propose a two-pronged analysis of emerging communication, at the pragmatic and semantic levels. At the pragmatic level, we study communicative acts from a functional perspective, measuring whether the messages produced by an agent have an impact on the subsequent behaviour of the other. At the semantic level, we decode which aspects of the extra-linguistic context the agents refer to, and how such reference acts differ between agents. Some of our conclusions are positive. Not only do the agents solve the shared task, but genuine bilateral communication helps them to reach higher reward. Moreover, their referential acts are meaningful given the task, carrying the semantics of their input. However, we also find that even perfectly symmetric agents converge to distinct idiolects instead of developing a single, shared code.

2 The fruit and tools game

Our game, inspired by Tomasello’s Tomasello (2014) conjecture that the unique cognitive abilities of humans arose from the requirements of cooperative interaction, is schematically illustrated in Fig. 1. In each episode, a randomly selected agent is presented with instances of two tools (knife, fork, axe…), the other with a fruit instance (apple, pear, plum…). Tools and fruits are represented by property vectors (e.g., has a blade, is small), with each instance characterized by values randomly varying around the category mean (e.g., an apple instance might be smaller than another). An agent is randomly selected to be the first to perform an action. The game then proceeds for an arbitrary number of turns. At each turn, one of the agents must decide whether to pick one of the two tools and stop, or to continue, in which case the message it utters is passed to the other agent, and the game proceeds. Currently, for ease of analysis, messages are single discrete symbols selected from a vocabulary of size $10$ , but extension to symbol sequences is trivial (although it would of course complicate the analysis). As soon as an agent picks a tool, the game ends. The agents receive a binary reward of 1 if they picked the better tool for the fruit at hand, 0 otherwise. The best choice is computed by a utility function that takes into account the interaction between tool and fruit instance properties (e.g., as in Fig. 1, a tool with a round edge might be particularly valuable if the fruit has a pit). Utility is relative: given a peach, the axe is worse than the spoon, but it would be the better tool when the alternative is a hammer.

Here are some desirable properties of our setup, as a simplified simulation of human interactions. The agents are fully symmetric and cannot specialize to a fixed role or turn-taking scheme. The number of turns is open and determined by the agents. In pure signaling/referential games (Lewis, 1969), the aim is successful communication itself. In our game, reward depends instead on tool and fruit affordances. Optimal performance can only be achieved by jointly reasoning about the properties of the tools and how they relate to the fruit. Humans are rewarded when they use language to solve problems of this sort, and not for successful acts of reference per se. Finally, as we use commonsense descriptions of everyday objects to build our dataset (see below), the distribution of their properties possesses the highly skewed characteristics encountered everywhere in the human environment Li (2002). For example, if the majority of fruits requires to be cut, a knife is intrinsically more useful than a spoon. Note that the agents do not have any a priori knowledge of the tools utility. Yet, baseline agents are able to discover context-independent tool affordances and already reach high performance. We believe that this scenario, in which communication-transmitted information complements knowledge that can be directly inferred by observing the world, is more interesting than typical games in which language is the only information carrier.

Game ingredients and utility

We picked 16 tool and 31 fruit categories from McRae et al. (2005) and Silberer et al. (2013), who provide subject-elicited property-based commonsense descriptions of objects, with some extensions. We used $11$ fruit and $15$ tool features from these databases to represent the categories. We rescaled the elicitation-frequency-based property values provided in the norms to lie in the $[0,1]$ range, and manually changed some counter-intuitive values. An object instance is a property vector sampled from the corresponding category as follows. For binary properties such as has a pit, we use Bernoulli sampling with $p$ equaling the category value. For continuous properties such as is small, we sample uniformly from $[\mu-0.1,\mu+0.1]$ , where $\mu$ is the category value. We then devised a function returning an utility score for any fruit-tool property vector pair. The function maps properties to a reduced space of abstract functional features (such as break for tools, and hard for fruits). Details are in Supplementary Section 7. For example, an apple with is crunchy=0.7 value gets a high hard functional feature score. A knife with has a blade=1 gets a high cut score, and therefore high utility for the hard apple. Some features, e.g., has a handle for tools, have no impact on utility. They only represent realistic aspects of objects and act as noise. Our dataset with full category property vectors will be publicly released along with code.

Datasets

We separate the $31$ fruit categories into three sets: in-domain ( $21$ categories), validation and transfer ( $5$ categories each). The in-domain set is further split into train and test partitions. We train agents on the in-domain train partition and monitor convergence on the validation set. We report test performance on the in-domain test partition and on the transfer set. For example, the peach category is in-domain, meaning that distinct peach instances will be seen at training and in-domain testing time. The nectarine category is in the transfer set, so nectarine instances will only be seen at test time. This scheme tests the generalization abilities of the agents (that can generalize to new fruits since they are all are described in the same feature space). We generate $210,000$ in-domain training samples and $25,000$ samples for the other sets, balanced across fruits and tools (that are common across the sets).

Game dynamics and agent architecture

At the beginning of a game episode, two neural network agents A and B receive, randomly, either a pair of tools $(tool_{1},tool_{2})$ (always sampled from different categories) or a $fruit$ . The agent receiving the tools (respectively the fruit) will be Tool Player (respectively Fruit Player) for the episode.111Agents must learn to recognize the assigned role. The agents are also randomly assigned positions, and the one in position 1 starts the game. Figure 2 shows two turns of the game in which A (blue/left) is Tool Player and in position 1. The first turn is indexed $t=0$ , therefore A will act on even turns, B on odd turns. At game opening, each agent passes its input (tool pair or fruit) through a linear layer followed by a tanh nonlinearity, resulting in embedding $i_{A}$ (resp. $i_{B}$ ). Then, at each turn $t$ , an agent, for example A, receives the message $m^{B}_{t-1}$ from agent B, and accesses its own previous internal state $s^{A}_{t-2}$ (we refer to ‘‘memory’’ the addition of the agent’s previous state). The message $m^{B}_{t-1}$ is processed by a RNN, and the resulting hidden state $h^{A}_{t}(m^{B}_{t-1})$ is concatenated with the agent previous internal state $s^{A}_{t-2}$ and the input embedding $i_{A}$ . The concatenated vector is fed to the Body module, composed of a linear layer followed by tanh. The output of the Body module is the new A state, $s^{A}_{t}$ , fed to Message and Choice decoders.

The Message decoder is an RNN with hidden state initialized as $s^{A}_{t}$ , and outputting a probability distribution $p(m^{A}_{t}|s^{A}_{t})$ over possible A messages. At training time, we sample a message $m^{A}_{t}$ ; at test time we take the most probable one. The Choice decoder is a linear layer processing $s^{A}_{t}$ and outputting a softmax-normalized vector of size $3$ . The latter represents the probabilities $p(c^{A}_{t}|s^{A}_{t})$ over A’s possible choices: (i) $c^{A}_{t}=0$ to continue playing, (ii) $c^{A}_{t}=1$ to choose tool $tool_{1}$ and stop (iii) $c^{A}_{t}=2$ to choose tool $tool_{2}$ and stop. Again, we sample at training and argmax at test time. If $c^{A}_{t}=0$ , the game continues. B receives message $m^{A}_{t}$ , its previous state $s^{B}_{t-1}$ and input embedding $i_{B}$ , and it outputs the tuple ( $m^{B}_{t+1}$ , $c^{B}_{t+1}$ , $s^{B}_{t+1}$ ) etc., until an agent stops the game, or the maximum number of turns $T_{\text{max}}=20$ is reached.

When an agent stops by choosing a tool, for example $tool_{1}$ , we compute the two utilities $U(tool_{1},fruit)$ and $U(tool_{2},fruit)$ . If $U(tool_{1},fruit)\geq U(tool_{2},fruit)$ , that is the best tool was chosen, shared reward is $R=1$ . If $U(tool_{1},fruit)<U(tool_{2},fruit)$ or if the agents reach $T_{\text{max}}$ turns without choosing, $R=0$ .222We also tried directly using raw or normalized scalar utilities as rewards, with similar performances. During learning, the reward is back-propagated with Reinforce (Williams, 1992). When the game starts at $t=0$ , we feed the agent in position $1$ a fixed dummy message $m^{0}$ , and the previous states of the agents $s^{A}_{t-2}$ and $s^{B}_{t-1}$ are initialized with fixed dummy $s^{0}$ . In the no-memory ablation, previous internal states are always replaced by $s^{0}$ . When we block communication, agent messages are replaced by $m^{0}$ . Supplementary Section 8 provides hyperparameter and training details.

3 Measuring communication impact

Message Effect

is computed on single turns and uses causal theory (Pearl et al., 2016) to quantify how much what an agent utters impacts the other, compared to the counterfactual scenario in which the speaking agent said something else.

Consider message $m^{A}_{t}$ uttered by A at turn $t$ . If $c^{A}_{t}=0$ (that is, A continues the game), $m^{A}_{t}$ is processed by B, along with $s^{B}_{t-1}$ and $i^{B}$ . At the next turn, B outputs a choice $c^{B}_{t+1}$ and a message $m^{B}_{t+1}$ drawn from $p(c^{B}_{t+1},m^{B}_{t+1}|s^{B}_{t+1})$ . B’s state $s^{B}_{t+1}$ is deterministically determined by $m^{A}_{t},c^{A}_{t},s^{B}_{t-1},i^{B}$ , so we can equivalently write that $c^{B}_{t+1}$ and $m^{B}_{t+1}$ are drawn from $p(c^{B}_{t+1},m^{B}_{t+1}|m^{A}_{t},c^{A}_{t},s^{B}_{t-1},i^{B})$ . Conditioning on $c^{A}_{t},s^{B}_{t-1},i^{B}$ ensures there are no confounders when we analyze the influence from $m^{A}_{t}$ (Pearl et al., 2016). Supplementary Figure A1 shows the causal graph supporting our assumptions. We will not from here onwards write the conditioning on $c^{A}_{t},s^{B}_{t-1},i^{B}$ explicitly.

We define $z^{B}_{t+1}=(c^{B}_{t+1},m^{B}_{t+1})$ . We want to estimate how much the message from A, $m^{A}_{t}$ , influences the next-turn behaviour (choice and message) of B, $z^{B}_{t+1}$ . We thus measure the discrepancy between the conditional distribution $p(z^{B}_{t+1}|m^{A}_{t})$ and the marginal distribution $p(z^{B}_{t+1})$ not taking $m^{A}_{t}$ into account. However, we want to assess agent B’s behaviour under other possible received messages $m^{A}_{t}$ . To do so, when we compute the marginal of agent B’s $p(z^{B}_{t+1})$ , we intervene on $m^{A}_{t}$ and draw the messages from the intervention distribution. We define $\tilde{p}(z^{B}_{t+1})$ , the marginal computed with counterfactual messages $m^{\prime A}_{t}$ , as:

[TABLE]

where $\tilde{p}(m^{\prime A}_{t})$ is the intervention distribution, different from $p(m^{A}_{t}|s^{A}_{t})$ . If at turn $t$ , A continues the game, we define the Message Effect (ME) from agent A’s message $m^{A}_{t}$ on agent B’s choice and message pair, $z^{B}_{t+1}$ as:

[TABLE]

where KL is the Kullback-Leibler divergence, and $\tilde{p}(z^{B}_{t+1})$ is computed as in Eq. 1. This allows us to measure how much the conditional distribution differs from the marginal. Algorithm 1 shows how we estimate $\text{ME}^{A\rightarrow B}_{t}$ . In our experiments, we draw $K=10$ samples $z^{B}_{t+1,k}$ , and use a uniform intervention distribution $\tilde{p}(m^{\prime A}_{t})$ with $J=10$ . This kind of counterfactual reasoning is explored in depth by Bottou et al. (2013). Jaques et al. (2018) and Lowe et al. (2019) present related measures of causal impact based on the Mutual Information (MI) between influencing and influenced agents. We discuss in Supplementary Section 9 possible issues with the MI-based approach.

Bilateral communication

Intuitively, there has been a proper dialogue if, in the course of a conversation, each agent has said at least one thing that influenced the other. We operationalize this through our bilateral communication measure. This is a binary, per-game score, that is positive only if in the game there has been at least one turn with significant message effect in both directions, i.e., $\exists\leavevmode\nobreak\ t,\leavevmode\nobreak\ t^{\prime}\leavevmode\nobreak\ s.t.\leavevmode\nobreak\ \text{ME}^{A\rightarrow B}_{t}>\theta$ and $\text{ME}^{B\rightarrow A}_{t^{\prime}}>\theta$ . We set $\theta=0.1$ .333We considered setting $\theta$ to (i) the average ME returned by untrained agents, but this led to a threshold extremely close to [math], and (ii) the average of the agents’ ME values, but this counterintuitively penalized pairs of agents with high overall communication influence.

4 Results

We first confirm that the agents succeed at the task, and communication improves their performance. Second, we study their pragmatics, looking at how ablating communication and memory affect their interaction. Finally, we try to interpret the semantics of the agents’ messages.

4.1 Performance and pragmatics

We report mean and standard error of the mean (SEM) over successful training seeds.444That is, training seeds leading to final validation performance above $85\%$ . Each agent A or B can be either $F$ (Fruit Player) or $T$ (Tool Player) and in position $1$ or $2$ , depending on the test game. We measure to what extent Tool Player influences Fruit Player ( $\text{ME}^{T\rightarrow F}$ ) and vice versa ( $\text{ME}^{F\rightarrow T}$ ). Similarly, we evaluate position impact by computing $\text{ME}^{1\rightarrow 2}$ and $\text{ME}^{2\rightarrow 1}$ . We average ME values over messages sent during each test game, and report averages over test games. Note that we also intervene on the dummy initialization message used at $t=0$ , which is received by the agent in position $1$ . This impacts the value of $\text{ME}^{2\rightarrow 1}$ . If the agent in position $1$ has learned to rely on the initialization message to understand that the game is beginning, an intervention on this message will have an influence we want to take into account.555Conversely, we ignore the messages agents send when stopping the game, as they are never heard. Similarly, in the no-communication ablation, when computing ME values, we replace the dummy fixed message the agents receive with a counterfactual. Finally, we emphasize that the computation of ME values does not interfere with game dynamics and does not affect performance.

Both communication and memory help

Table 1 shows that enabling the agents to communicate greatly increases performance compared to the no-communication ablation, both with and without memory, despite the high baseline set by agents that learn about tool usefulness without communicating (see discussion below). Agents equipped with memory perform better than their no-memory counterparts, but the gain in performance is smaller compared to the gain attained from communication. The overall best performance is achieved with communication and memory. We also see that the agents generalize well, with transfer-fruit performance almost matching that on in-domain fruits. Next, we analyze in detail the impact of each factor (communication, memory) on agent performance and strategies.

No-communication, no-memory

We start by looking at how the game unfolds when communication and agent memory are ablated (top left quadrant of Table 1). Performance is largely above chance ( $\approx{}50\%$ ), because, as discussed in Section 2, some tools are intrinsically better on average across fruits than others. Without communication, the agents exploit this bias and learn a strategy where (i) Fruit Player never picks the tool but always continues the game and (ii) Tool Player picks the tool according to average tool usefulness. Indeed, Tool Player makes the choice in more than $99\%$ of the games. Conversation length is [math] if Tool Player starts and $1$ if it is the second agent, requiring the starting Fruit Player to pass its turn. Reassuringly, ME values are low, confirming the reliability of this communication score, and indicating that communication-deprived agents did not learn to rely on the fixed dummy message (e.g., by using it as a constant bias). Still, we observe that, across the consistently low values, Fruit Player appears to affect Tool Player significantly more than the reverse ( $\text{ME}^{F\rightarrow T}>\text{ME}^{T\rightarrow F}$ ). This is generally observed in all configurations, and we believe it due to the fact that Tool Player takes charge of most of the reasoning in the game. We come back to this later in our analysis. We also observe that the second player impacts the first more than the reverse ( $\text{ME}^{2\rightarrow 1}>\text{ME}^{1\rightarrow 2}$ ). We found this to be an artifact of the strategy adopted by the agents. In the games in which Tool Player starts and immediately stops the game, we can only compute ME for the Tool/position-1 agent, by intervening on the initialization. The resulting value, while tiny, is unlikely to be exactly [math]. In the games where Fruit Player starts and Tool Player stops at the second turn, we compute instead two tiny MEs, one per agent. Hence, the observed asymmetry. We verified this hypothesis by removing single-turn games: the influence of the second player on the first indeed disappears.

Impact of communication

The top quadrants of Table 1 show that communication helps performance, despite the high baseline set by the ‘‘average tool usefulness’’ strategy. Importantly, when communication is added, we see a dramatic increase in the proportion of games with bilateral communication, confirming that improved performance is not due to an accidental effect of adding a new channel Lowe et al. (2019). ME and average number of turns also increase. Fruit Player is the more influential agent. This effect is not due to the artifact we found in the no-communication ablation, because almost all conversations, including those started by Tool Player, are longer than one turn, so we can compute both $\text{ME}^{F\rightarrow T}$ and $\text{ME}^{T\rightarrow F}$ . We believe the asymmetry to be due to the fact that Tool Player is the agent that demands more information from the other, as it is the one that sees the tools, and that in the large majority of cases makes the final choice. Supplementary Table A3 shows that the gap between the influence of the Fruit Player on the Tool player and its reverse is greater when the Fruit Player is in position $2$ . This, then, explains $\text{ME}^{2\rightarrow 1}>\text{ME}^{1\rightarrow 2}$ as an epiphenomenon of Fruit Player being more influential.

Is memory ablation necessary for communication to matter?

An important observation from previous research is that depriving at least one agent of memory might be necessary to develop successful multi-turn communication Kottur et al. (2017); Cao et al. (2018); Evtimova et al. (2018). This is undesirable, as obviously language should not emerge simply as a surrogate for memories of amnesiac agents. The performance and communicative behaviours results in the bottom right quadrant of Table 1 show that, in our game, genuine linguistic interaction (as cued by ME and bilateral communication scores) is present even when both agents are equipped with memory. It is interesting however to study how adding memory affects the game dynamics independently of communication. In the bottom left quadrant, we see that memory leads to some task performance improvement for communication-less agents. Manual inspection of example games reveals that such agents are developing turn-based strategies. For example, Tool Player learns to continue the game at turn $t$ if $tool_{1}$ has a round end. At $t+1$ , Fruit Player can use the fact that Tool Player continues at $t$ as information about relative tool roundness, and either pick the appropriate one based on the fruit or continue to gather more information. In a sense, agents learn to use the possibility to stop or continue at each turn as a rudimentary communication channel. Indeed, exchanges are on average longer when memory is involved, and turn-based strategies appear even with communication. In the latter case, agents rely on communication but also on turn-based schemes, resulting in lower ME values and bilateral communication compared to the no-memory ablation. Finally, the respective positions of the agents in the conversation no longer impact ME ( $\text{ME}^{1\rightarrow 2}\approx{}\text{ME}^{2\rightarrow 1}$ ). This might be because, with memory, the starting agent can identify whether it is at turn $t=0$ , where it almost always chooses to continue the game to send and receive more information via communication. Intervening on the dummy initialization message has a lower influence, resulting in lower $\text{ME}^{2\rightarrow 1}$ .

4.2 Conversation semantics

Having ascertained that our agents are conducting bidirectional conversations, we next try to decode what are the contents of such conversations. To do this, we train separate classifiers to predict, from the message exchanges in successful in-domain test game, what are Fruit, Tool 1, Tool 2 in the game.666We focus on the in-domain set as there are just 5 transfer fruit categories. We also tried predicting triples at once with a single classifier, that consistently reached above-baseline but very low accuracies. Consider for example a game in which fruit is apple and tools 1 and 2 knife and spoon, respectively. If the message-based classifiers are, say, able to successfully decode apple but not knife/spoon, this suggests that the messages are about the fruit but not the tools. For each prediction task, we train classifiers (i) on the whole conversation, i.e., both agents’ utterances (Both), and (ii) on either Player’s utterances: Fruit (F) or Tool only (T). For comparison, we also report accuracy of a baseline that makes guesses based on the train category distribution (Stats), which is stronger than chance. We report mean accuracy and SEM across successful training seeds. Supplementary Section 11 provides further details on classifier implementation and training.

The first row of Table 2 shows that the conversation as a whole carries information about any object. The second and third show that the agents are mostly conveying information about their respective objects (which is very reasonable), but also, to a lesser extent, but still well above baseline-level, about the other agent’s input. This latter observation is intriguing. Further work should ascertain if it is an artifact of fruit-tool correlations, or pointing in the direction of more interesting linguistic phenomena (e.g., asking ‘‘questions’’). The asymmetry between Tool 1 and 2 would also deserve further study, but importantly the agents are clearly referring to both tools, showing they are not adopting entirely degenerate strategies.777We experiment with single symbol messages (and multi-turn conversation) but using longer messages we could potentially witness interesting phenomena such as the emergence of compositionality. We leave this exploration for future work.

We tentatively conclude that the agents did develop the expected semantics, both being able to refer to all objects in the games. Did they however developed shared conventions to refer to them, as in human language? This would not be an unreasonable expectation, since the agents are symmetric and learn to play both roles and in both positions. Following up on the idea of ‘‘self-play’’ of Graesser et al. (2019), after a pair of agents A and B are trained, we replace at test time agent B’s embedders and modules with those in A, that is, we let one agent play with a copy of itself. If A and B are speaking the same language, this should not affect test performance. Instead, we find that with self-play average game performance drops down to $67\%$ and $65\%$ in in-domain and transfer test sets, respectively. This suggests that the agents developed their own idiolects. The fact that performance is still above chance could be due to the fact that the latter are at least partially exchangeable, or simply to the fact that agents can still do reasonably well by relying on knowledge of average tool usefulness (self-play performance is below that of the communication-less agents in Table 1). To decide between these interpretations, we trained the semantic classifier on conversations where A is the Fruit Player and B the Tool Player, testing on conversations about the same inputs, but where the roles are inverted. The performance drops down to the levels of the Stats baseline (Supplementary Table A4), supporting the conclusion that non-random performance is due to knowledge acquired by the agents independently of communication, and not partial similarity among their codes.

5 Related work

Games

Among the long history of early works that model language evolution between agents (e.g. Steels, 2003; Brighton et al., 2003), Reitter and Lebiere (2011) simulate human language evolution with a Pictionary type task. Most recently, with the advent of neural network architectures, literature focuses on simple referential games with a sender sending a single message to a receiver, and reward depending directly on communication success (e.g., Lazaridou et al., 2017; Havrylov and Titov, 2017; Lazaridou et al., 2018). Evtimova et al. (2018) extend the referential game presenting the sender and receiver with referent views in different modalities, and allowing multiple message rounds. Still, reward is given directly for referential success, and the roles and turns of the agents are fixed. Das et al. (2017) generalize Lewis’ signaling game (Lewis, 1969) and propose a cooperative image guessing game between two agents, a question bot and an answer bot. They find that grounded language emerges without supervision. Cao et al. (2018) (expanding on Lewis et al., 2017) propose a setup where two agents see the same set of items, and each is provided with arbitrary, episode-specific utility functions for the object. The agents must converge in multi-turn conversation to a decision about how to split the items. The fundamental novelty of our game with respect to theirs is that our rewards depend on consistent, realistic commonsense knowledge that is stable across episodes (hammers are good to break hard-shell fruits, etc.). Mordatch and Abbeel (2018) (see also Lowe et al., 2017) study emergent communication among multiple ( $>2$ ) agents pursuing their respective goals in a maze. In their setup, fully symmetric agents are encouraged to use flexible, multi-turn communication as a problem-solving tool. However, the independent complexities of navigation make the environment somewhat cumbersome if the aim is to study emergent communication.

Communication analysis

Relatively few papers have focused specifically on the analysis of the emergent communication protocol. Among the ones more closely related to our line of inquiry, Kottur et al. (2017) analyze a multi-turn signaling game. One important result is that, in their game, the agents only develop a sensible code if the sender is deprived of memory across turns. Evtimova et al. (2018) study the dynamics of agent confidence and informativeness as a conversation progresses. Cao et al. (2018) train probe classifiers to predict, from the messages, each agent utility function and the decided split of items. Most directly related to our pragmatic analysis, Lowe et al. (2019), who focus on simple matrix communication games, introduce the notions of positive signaling (an agent sends messages that are related to its state) and positive listening (an agent’s behaviour is influenced by the message it receives). They show that positive signaling does not entail positive listening, and commonly used metrics might not necessarily detect the presence of one or the other. We build on their work, by focusing on the importance of mutual positive listening in communication (our ‘‘bilateral communication’’ measure). We further refine the causal approach to measuring influence they introduce. Jaques et al. (2018) also use the notion of causal influence, both directly as a term in the agent cost function, and to analyze their behaviour.

6 Discussion

We introduced a more challenging and arguably natural game to study emergent communication in deep network agents. Our experiments show that these agents do develop genuine communication even when

(i) successful communication per se is not directly rewarded;

(ii) the observable environment already contains stable, reliable information helping to solve the task (object affordances); and

(iii) the agents are not artificially forced to rely on communication by erasing their memory.

The linguistic exchanges of the agents are not only leading to significantly better task performance, but can be properly pragmatically characterized as dialogues, in the sense that the behaviour of each agent is affected by what the other agent says. Moreover, they use language, at least in part, to denote the objects in their environment, showing primitive hallmarks of a referential semantics.

We also find, however, that agent pairs trained together in fully symmetrical conditions develop their own idiolects, such that an agent won’t (fully) understand itself in self play. As convergence to a shared code is another basic property of human language, in future research we will explore ways to make it emerge. First, we note that Graesser et al. (2019), who study a simple signaling game, similarly conclude that training single pairs of agents does not lead to the emergence of a common language, which requires diffusion in larger communities. We intend to verify if a similar trend emerges if we extend our game to larger agent groups. Conversely, equipping the agents with a feedback loop in which they also receive their own messages as input might encourage shared codes across speaker and listener roles.

In the current paper, we limited ourselves to one-symbol messages, facilitating analysis but greatly reducing the spectrum of potentially emergent linguistic phenomena to study. Another important direction for future work is thus to endow agents with the possibility of producing, at each turn, a sequence of symbols, and analyze how this affects conversation dynamics and the communication protocol. Finally, having shown that agents succeed in our setup, we intend to test them with larger, more challenging datasets, possibly involving more realistic perceptual input.

Acknowledgments

We thank Rahma Chaabouni, Evgeny Kharitonov, Emmanuel Dupoux, Maxime Oquab and Jean-Rémi King for their useful discussions and insights. We thank David Lopez-Paz and Christina Heinze-Deml for their feedback on the causal influence of communication. We also thank Francisco Massa for his help on setting up the experiments.

7 Data and utility computation

This section provides additional details on the dataset we use and the utility function we employ to compute the utilities between fruits and tools. Note that we refer to fruits for conciseness, but some vegetables, such as carrot and potato, are included.

There are $11$ fruits features: is crunchy, has skin, has peel, is small, has rough skin, has a pit, has milk, has a shell, has hair, is prickly, has seeds and $15$ tools features: has a handle, is sharp, has a blade, has a head, is small, has a sheath, has prongs, is loud, is serrated, has handles, has blades, has a round end, is adorned with feathers, is heavy, has jaws. Note that, when we sample instances of each category as explained in Section 2 of the main paper, features are sampled independently. We filter out, however, nonsensical combinations. For example, the features has prongs, has a blade and has blades are treated as pairwise mutually exclusive.

In order to compute the utility for a pair ( $tool$ , $fruit$ ), we use three mapping matrices. The mapping matrix $M_{T}\in\mathbb{R}^{15\times 6}$ (Table LABEL:tab:mt) maps from the space of tool features to a space of more general functional features: (cut, spear, lift, break, peel, pit remover), and similarly $M_{F}\in\mathbb{R}^{11\times 6}$ (Table 7) maps from the space of fruits features to a space of functional features: (hard, pit, shell, pick, peel, empty inside). Finally, the matrix $M\in\mathbb{R}^{6\times 6}$ (Table A2) maps the two abstract functional spaces of features together. For example, if an axe sample is described by the vector $t_{a}\in\mathbb{R}^{1\times 15}$ and a nectarine sample is the vector $f_{n}\in\mathbb{R}^{1\times 11}$ , the utility is computed as $U(t_{a},f_{n})=(f_{n}M_{F})M^{\prime}(t_{a}M_{T})^{\prime}$ where ′ denotes transpose. We always add a value of $0.01$ to avoid zero utilities. Therefore we can compute the utility of any combination of (possibly new) fruits and tools, as long as it can be described in the corresponding functional representational space. Note that in our case we have the same number of abstract functional features for fruits and tools ( $6$ ), but they need not be the same. In other words, $M$ need not be a square matrix.

Given the values in the mapping matrices, $5$ of the tools features have no impact on the utility computation since they do not affect the scores of the functional tool features (they have only zeros in the mapping matrix $M_{T}$ ). These are: has a handle, is sharp, has a sheath, is loud, has handles, is adorned with feathers. Such features only represent realistic aspects of objects and act as noise.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research , 14:3207--3260.
2Bouchacourt and Baroni (2018) Diane Bouchacourt and Marco Baroni. 2018. How agents see things: On visual representations in an emergent language game. In Proceedings of EMNLP , pages 981--985, Brussels, Belgium.
3Brighton et al. (2003) Henry Brighton, Simon Kirby, and Kenneth Smith. 2003. Situated cognition and the role of multi-agent models in explaining language structure. , volume 2636, pages 88--109. Springer-Verlag Gmb H.
4Cao et al. (2018) Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z. Leibo, Karl Tuyls, and Stephen Clark. 2018. Emergent communication through negotiation. In Proceedings of ICLR .
5Choi et al. (2018) Edward Choi, Angeliki Lazaridou, and Nando de Freitas. 2018. Compositional obverter communication learning from raw visual input. In Proceedings of ICLR Conference Track , Vancouver, Canada. Published online: https://openreview.net/group?id=ICLR.cc/2018/Conference .
6Das et al. (2017) Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In 2017 IEEE International Conference on Computer Vision (ICCV) .
7Evtimova et al. (2018) Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. 2018. Emergent communication in a multi-modal, multi-step referential game. In Proceedings of ICLR Conference Track , Vancouver, Canada. Published online: https://openreview.net/group?id=ICLR.cc/2018/Conference .
8Graesser et al. (2019) Laura Graesser, Kyunghyun Cho, and Douwe Kiela. 2019. Emergent linguistic phenomena in multi-agent communication games . Co RR .