Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function   for Probability Distributions

Masataro Asai

arXiv:1812.01217·cs.LG·December 6, 2018

Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function for Probability Distributions

Masataro Asai

PDF

Open Access

TL;DR

This paper introduces Set Cross Entropy, a permutation-invariant loss function for neural networks that reconstruct sets without relying on specific network architectures or sequential algorithms, with applications in object reconstruction and rule learning.

Contribution

The paper presents a novel likelihood-based loss function for set reconstruction that is permutation-invariant and does not depend on network topology or sequential processing.

Findings

01

Effective in object reconstruction tasks

02

Applicable to rule learning tasks

03

Has a natural information-theoretic interpretation

Abstract

We propose a permutation-invariant loss function designed for the neural networks reconstructing a set of elements without considering the order within its vector representation. Unlike popular approaches for encoding and decoding a set, our work does not rely on a carefully engineered network topology nor by any additional sequential algorithm. The proposed method, Set Cross Entropy, has a natural information-theoretic interpretation and is related to the metrics defined for sets. We evaluate the proposed approach in two object reconstruction tasks and a rule learning task.

Tables5

Table 1. Table 1: The best reconstruction success ratio in the 10 runs of 16 training scenarios. Among the four loss functions, best results are shown in bold . SH SH \mathrm{SH} and A 1 H subscript 𝐴 1 H A_{1\mathrm{H}} both succeeded to reconstruct the binary vectors in the 8 puzzles. The traditional cross entropy H H \mathrm{H} and the Hausdorff distance ℋ 1 H subscript ℋ 1 H \mathcal{H}_{1\mathrm{H}} both failed to reconstruct the binary vectors.

Target ordering	Fixed		Random
Input ordering	(1)Fixed	(2)Random	(3)Fixed	(4)Random
(a) $H$	1.00	1.00	0.00	0.00
(b) $SH$	1.00	1.00	1.00	1.00
(c) $A_{1 H}$	1.00	1.00	1.00	1.00
(d) $ℋ_{1 H}$	0.00	0.00	0.00	0.00

Table 2. Table 2: The best reconstruction success ratio in the 10 runs of 16 training scenarios, where the size of the set randomly varies from 4 to 9 in the dataset. Among the four loss functions, best results are shown in bold . The proposed SH SH \mathrm{SH} achieved the best success rate overall, A 1 H subscript 𝐴 1 H A_{1\mathrm{H}} comes next, the traditional cross entropy H H \mathrm{H} and the Hausdorff distance ℋ 1 H subscript ℋ 1 H \mathcal{H}_{1\mathrm{H}} both failed in most cases.

Target ordering	Fixed		Random
Input ordering	(1)Fixed	(2)Random	(3)Fixed	(4)Random
(a) $H$	0.76	0.85	0.12	0.00
(b) $SH$	0.60	0.65	0.62	0.62
(c) $A_{1 H}$	0.57	0.56	0.58	0.56
(d) $ℋ_{1 H}$	0.00	0.00	0.00	0.00

Table 3. Table 3: RMSE between the visualized images, the best results of 10 runs. Both SH SH \mathrm{SH} and A 1 H subscript 𝐴 1 H A_{1\mathrm{H}} successfully converged below the sufficient accuracy.

Target ordering	Fixed		Random
Input ordering	(1)Fixed	(2)Random	(3)Fixed	(4)Random
(a) $H$	0.10	0.10	0.15	0.15
(b) $SH$	0.08	0.08	0.07	0.08
(c) $A_{1 H}$	0.08	0.10	0.08	0.08
(d) $ℋ_{1 H}$	0.14	0.15	0.16	0.15

Table 4. Table 4: The rate of correct answers on the test set, best results of 10 runs. “Random” indicates that the target output is shuffled.

	$n = 2$ , $neighbor2 (a, b, c) :- \dots$		$n = 3$ , $neighbor3 (a, b, c, d) :- \dots$
	Dataset: 2858 ground clauses		Dataset: 11000 ground clauses
	Training: 2250 clauses; Test: 250 clauses.		Training: 2250 clauses; Test: 250 clauses.
	Fixed	Random	Fixed	Random
(a) $H$	0.32	0.36	0.10	0.10
(b) $SH$	0.96	0.94	0.72	0.70
(c) $A_{1 H}$	0.91	0.94	0.61	0.60
(d) $ℋ_{1 H}$	0.87	0.85	0.55	0.57
	$n = 4$ , $neighbor4 (a, b, c, d, e) :- \dots$		$n = 4$ , $neighbor4 (a, b, c, d, e) :- \dots$
	Dataset: 39878 ground clauses		Dataset: 39878 ground clauses
	Training: 2250 clauses; Test: 250 clauses.		Training: 9000 clauses; Test: 1000 clauses.
	Fixed	Random	Fixed	Random
(a) $H$	0.02	0.03	0.04	0.04
(b) $SH$	0.38	0.36	0.87	0.86
(c) $A_{1 H}$	0.33	0.34	0.81	0.79
(d) $ℋ_{1 H}$	0.22	0.24	0.50	0.53
	$n = 5$ , $neighbor5 (a, b, c, d, e, f) :- \dots$		$n = 5$ , $neighbor5 (a, b, c, d, e, f) :- \dots$
	Dataset: 137738 ground clauses		Dataset: 137738 ground clauses
	Training: 2250 clauses; Test: 250 clauses.		Training: 9000 clauses; Test: 1000 clauses.
	Fixed	Random	Fixed	Random
(a) $H$	0.00	0.01	0.01	0.01
(b) $SH$	0.17	0.15	0.65	0.60
(c) $A_{1 H}$	0.13	0.13	0.53	0.56
(d) $ℋ_{1 H}$	0.06	0.06	0.20	0.18

Table 5. Table 5: The number of instances solved by Latplan using a VAE trained by Set Cross Entropy ( SH SH \mathrm{SH} ) and Set Average ( A 1 H subscript 𝐴 1 H A_{1\mathrm{H}} ) of the cross entropy.

Random walk steps	The number of solved instances
used for generating	(out of 10 instances each)
the problem instances	$SH$	$A_{1 H}$
3	7	7
7	5	3
14	2	1

Equations36

f (X) = ρ (x \in X \sum ϕ (x))

f (X) = ρ (x \in X \sum ϕ (x))

H_{1 d} (X, Y) = x \in X max y \in Y min d (x, y)

H_{1 d} (X, Y) = x \in X max y \in Y min d (x, y)

A_{1 d} (X, Y) = \frac{1}{∣ X ∣} x \in X \sum y \in Y min d (x, y) .

A_{1 d} (X, Y) = \frac{1}{∣ X ∣} x \in X \sum y \in Y min d (x, y) .

H (x, y) = E_{v \sim p_{data}} ⟨ - lo g p_{model} (v = v) ⟩ = E_{v \sim p_{data}} ⟨ - lo g p_{model} (i = 1 ⋀ F v_{i} = v_{i}) ⟩ = i = 1 \sum F - x_{i} lo g y_{i} - (1 - x_{i}) lo g (1 - y_{i}) .

H (x, y) = E_{v \sim p_{data}} ⟨ - lo g p_{model} (v = v) ⟩ = E_{v \sim p_{data}} ⟨ - lo g p_{model} (i = 1 ⋀ F v_{i} = v_{i}) ⟩ = i = 1 \sum F - x_{i} lo g y_{i} - (1 - x_{i}) lo g (1 - y_{i}) .

v = v ⟺ i = 1 ⋀ F v_{i} = v_{i}

v = v ⟺ i = 1 ⋀ F v_{i} = v_{i}

S = T \Leftrightarrow (S \subseteq T \land S \supseteq T) \Leftrightarrow (\forall s \in S; s \in T) \land (\forall t \in T; t \in S) .

S = T \Leftrightarrow (S \subseteq T \land S \supseteq T) \Leftrightarrow (\forall s \in S; s \in T) \land (\forall t \in T; t \in S) .

S = T \Leftrightarrow S \supseteq T \Leftrightarrow \Leftrightarrow \Leftrightarrow \forall t \in T; t \in S \forall t \in T; \exists s \in S; s = t t \in T ⋀ s \in S ⋁ s = t .

S = T \Leftrightarrow S \supseteq T \Leftrightarrow \Leftrightarrow \Leftrightarrow \forall t \in T; t \in S \forall t \in T; \exists s \in S; s = t t \in T ⋀ s \in S ⋁ s = t .

E_{V \sim p_{data}} ⟨ - lo g p_{model} (V = V) ⟩

E_{V \sim p_{data}} ⟨ - lo g p_{model} (V = V) ⟩

lo g p_{model} (V = V)

lo g p_{model} (V = V)

= i = 1 \sum N lo g p_{model} (j = 1 ⋁ N v_{j} = v_{i})

\geq i = 1 \sum N lo g j = 1 \sum N p_{model} (v_{j} = v_{i})

= i = 1 \sum N lo g j = 1 \sum N exp lo g p_{model} (v_{j} = v_{i})

= i = 1 \sum N logsumexp_{j = 1}^{N} lo g p_{model} (v_{j} = v_{i}) .

E_{V \sim p_{data}} ⟨ - lo g p_{model} (V = V) ⟩

E_{V \sim p_{data}} ⟨ - lo g p_{model} (V = V) ⟩

= - i \sum E_{V \sim p_{data}} ⟨ logsumexp_{j} lo g p_{model} (v_{j} = v_{i}) ⟩

= - i \sum E_{v_{i} \sim p_{data}} ⟨ logsumexp_{j} lo g p_{model} (v_{j} = v_{i}) ⟩

\leq - i \sum logsumexp_{j} E_{v_{i} \sim p_{data}} ⟨ lo g p_{model} (v_{j} = v_{i}) ⟩

= - i \sum logsumexp_{j} (- H (x_{i}, y_{j})) = def SH (X, Y) .

- i \sum logsumexp_{j} (- H (x_{i}, y_{j}))

- i \sum logsumexp_{j} (- H (x_{i}, y_{j}))

∴ SH (X, Y)

SH (X, Y)

SH (X, Y)

\leq i \sum H (x_{i}, y_{i})

A_{1 H} (X, Y) = \frac{1}{N} i \sum j min H (x_{i}, y_{j}) \leq i max j min H (x_{i}, y_{j}) = H_{1 H} (X, Y) .

A_{1 H} (X, Y) = \frac{1}{N} i \sum j min H (x_{i}, y_{j}) \leq i max j min H (x_{i}, y_{j}) = H_{1 H} (X, Y) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Remote Sensing and LiDAR Applications · Generative Adversarial Networks and Image Synthesis

Full text

Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function for Probability Distributions

Masataro Asai

IBM Research

Abstract

We propose a permutation-invariant loss function designed for the neural networks reconstructing a set of elements without considering the order within its vector representation. Unlike popular approaches for encoding and decoding a set, our work does not rely on a carefully engineered network topology nor by any additional sequential algorithm. The proposed method, Set Cross Entropy, has a natural information-theoretic interpretation and is related to the metrics defined for sets. We evaluate the proposed approach in two object reconstruction tasks and a rule learning task.

1 Introduction

Sets are fundamental mathematical objects which appear frequently in the real-world dataset. However, there are only a handful of studies on learning a set representation in the machine learning literature. In this study, we propose a new objective function called Set Cross Entropy (SCE) to address the permutation invariant set generation. SCE measures the cross entropy between two sets that consists of multiple elements, where each element is represented as a multi-dimensional probability distribution in $[0,1]\subset\mathbb{R}$ (a closed set of reals between 0,1). SCE is invariant to the object permutation, therefore does not distinguish two vector representations of a set with the different ordering. The SCE is simple enough to fit in one line and can be naturally interpreted as a formulation of the log-likelihood maximization between two sets derived from a logical statement.

The SCE loss trains a neural network in a permutation-invariant manner, and the network learns to output a set. Importantly, this is not to say that the neural network learns to represent a function that is permutation-invariant with regard to the input. The key difference in our approach is that we allow the network to output a vector representation of a set that may have a different ordering than the examples used during the training. In contrast, previous studies focus on learning a function that returns the same output for the different permutations of the input elements. Such scenarios assume that an output value at some index is matched against the target value at the same index. 111 For example, the permutation-equivariant/invariant layers in Zaheer et al. (2017) are evaluated on the classification tasks, the regression tasks and the set expansion tasks. In the classification and the regression tasks, the network predicts a single value, which has no ordering. In the set expansion tasks, the network predicts the probability $p_{i}$ for each tag $i$ in the output vector. Thus, reordering the output does not make sense.

This characteristic is crucial in the tasks where the objects included in the supervised signals (training examples for the output) do not have any meaningful ordering. For example, in the logic rule learning tasks, a first-order logic horn clause does not care about the ordering inside the rule body since logical conjunctions are invariant to permutations, e.g. $\textsf{father}(\textsf{c},\textsf{f})\leftarrow{\left(\textsf{parent}(\textsf{c},\textsf{f})\land\textsf{male}(\textsf{f})\right)}$ and $\textsf{father}(\textsf{c},\textsf{f})\leftarrow{\left(\textsf{male}(\textsf{f})\land\textsf{parent}(\textsf{c},\textsf{f})\right)}$ are equivalent.

To apply our approach, no special engineering of the network topology is required other than the standard hyperparameter tuning. The only requirement is that the target output examples are the probability vectors in $[0,1]^{N\times F}$ , which is easily addressed by an appropriate feature engineering including autoencoders with softmax or sigmoid latent activation.

We demonstrate the effectiveness of our approach in two object-set reconstruction tasks and the supervised theory learning tasks that learn to perform the backward chaining of the horn clauses. In particular, we show that the SCE objective is superior to the training using the other set distance metrics, including Hausdorff and set average (Chamfer) distances.

2 Backgrounds and Related Work

2.1 Learning a Set Representation

Previous studies try to discover the appropriate structure for the neural networks that can represent a set. Notable recent work includes permutation-equivariant / invariant layers that addresses the permutation in the input (Guttenberg et al., 2016; Ravanbakhsh et al., 2016; Zaheer et al., 2017).

Let $X$ be a vector representation of a set ${\left\{x_{1},\ldots,x_{n}\right\}}$ and $\pi$ be an arbitrary permutation function for a sequence. A function $f(X)$ is permutation invariant when $\forall\pi;f(X)=f(\pi(X))$ . Zaheer et al. (2017) showed that functions are permutation-invariant iff it can be decomposed into a form

[TABLE]

where $\rho,\phi$ are the appropriate mapping function.

However, as mentioned in the introduction, the aim of these layers is to learn the functions that are permutation-invariant with regard to the input permutation, and not to reconstruct a set in a permutation-invariant manner (i.e. ignoring the ordering). In other words, permutation-equivariant/invariant layers are only capable of encoding a set.

Probst (2018) recently proposed a method dubbed as “Set Autoencoder”. It additionally learns a permutation matrix that is applied before the output so that the output matches the target. The target for the permutation matrix is generated by a Gale-Shapley greedy stable matching algorithm, which requires $O(n^{2})$ runtime. The output is compared against the training example with a conventional loss function such as binary cross entropy or mean squared error, which requires the final output to have the same ordering as the target. Therefore, this work tries to learn the set as well as the ordering between the elements, which is conceptually different from learning to reconstruct a set while ignoring the ordering.

Another line of related work utilizes Sinkhorn iterations (Adams & Zemel, 2011; Santa Cruz et al., 2017; Mena et al., 2018) in order to directly learn the permutations. Again, these work assumes that the output is generated in a specific order (e.g. a sorting task), which does not align with the concept of the set reconstruction.

2.2 Set Distance

Set distances / metrics are the binary functions that satisfy the metric axioms. They have been utilized for measuring the visual object matching or for feature selection (Huttenlocher et al., 1993; Dubuisson & Jain, 1994; Piramuthu, 1999). Note that, however, in this work, we use the informal usage of the terms “distance” or “metric” for any non-negative binary functions that may not satisfy the metric axioms.

There are several variants of set distances. Hausdorff distance between sets (Huttenlocher et al., 1993) is a function that satisfies the metric axiom. For two sets $X$ and $Y$ , the directed Hausdorff distance with an element-wise distance $d(x,y)$ is defined as follows:

[TABLE]

The element-wise distance $d$ is Euclidean distance or Hamming distance, for example, depending on the target domain.

Set average (pseudo) distance (Dubuisson & Jain, 1994, Eq.(6)), also known as Chamfer distance, is a modification of the original Hausdorff distance which aggregates the element-wise distances by summation. The directed version is defined as follows:

[TABLE]

Set average distance has been used for image matching, as well as to autoencode the 3D point clouds in the euclidean space for shape matching (Zhu et al., 2016).

3 Set Cross Entropy

Inspired by the various set distances, we propose a straightforward formulation of likelihood maximization between two sets of probability distributions. In what follows, we define the cross entropy between two sets ${\bm{X}},{\bm{Y}}\in[0,1]^{N\times F}$ , where $[0,1]$ is a closed set of reals between 0 and 1.

Let $\mathcal{X}={\left\{{\bm{X}}^{(1)},{\bm{X}}^{(2)},\ldots\right\}}$ be the training dataset, and ${\bm{Y}}$ be the output matrix of a neural network. Assume that each ${\bm{X}}\in\mathcal{X}$ consists of $N$ elements where each element is represented by $F$ features, i.e. ${\bm{X}}={\left\{{\bm{x}}_{1}\ldots{\bm{x}}_{N}\right\}},{\bm{x}}_{i}\in\mathbb{R}^{F}$ . We further assume that ${\bm{x}}_{i}\in[0,1]^{F}$ by a suitable transformation, e.g., feature learning using an autoencoder with the sigmoid activation added to the latent layer. The set ${\bm{X}}$ actually takes the vector representation, which essentially makes ${\bm{X}}\in[0,1]^{N\times F}$ . In this paper, we focus on the Bernoulli distribution, where each feature ${x}_{i,f}$ represents the probability that the corresponding binary random variable ${\textnormal{v}}_{i,f}$ being true, i.e. ${x}_{i,f}=P({\textnormal{v}}_{i,f}=1)$ , which also means $P({\textnormal{v}}_{i,f}=0)=1-{x}_{i,f}$ . However, the proposed method naturally extends to the multinomial case where each elementary feature could be represented as a probability vector that sums to 1 instead of a single value in $[0,1]$ .

For simplicity, we assume that the number of elements in the set ${\bm{X}}$ and ${\bm{Y}}$ is known and fixed to $N$ . Therefore ${\bm{Y}}$ is also a matrix in $[0,1]^{N\times F}$ . Furthermore, we assume that ${\bm{X}}$ is preprocessed and contains no duplicated elements.

In practice, if $|{\bm{X}}|$ varies across the dataset, it suffices to take $N^{\max}=\max_{{\bm{X}}\in\mathcal{X}}|{\bm{X}}|$ , the largest number of elements in ${\bm{X}}$ across the dataset $\mathcal{X}$ , and add the dummy, distinct elements $d_{0}\ldots d_{N^{\max}}$ to fill in the blanks. For example, when there are $N$ objects of $F$ features and we want to normalize the size of the set to $M(>N)$ , one way is to add an additional axis to the feature vector ( $F+1$ features) where the additional $F+1$ -th feature is 0 for the real data and 1 for the dummy data, and the additional $M-N$ objects are generated in an arbitrary way (e.g. as a binary sequence 100000, 100001, 100010, 100011, … for $F=5$ ).

Let ${\mathbf{v}}\in{\left\{0,1\right\}}^{F}$ a $F$ -dimensional binary random variable, where $p_{\rm{data}}({\textnormal{v}}_{i}=1)=x_{i}$ , $p_{\rm{model}}({\textnormal{v}}_{i}=1)=y_{i}$ , and ${\bm{x}},{\bm{y}}\in[0,1]^{F}$ . For measuring the similarity between two $F$ -dimensional probability distributions $p_{\rm{data}}$ and $p_{\rm{model}}$ , the natural loss function would be the cross entropy or the negative log likelihood, denoted as $\mathrm{H}({\bm{x}},{\bm{y}})$ .

[TABLE]

However, applying it directly to the matrices ${\bm{X}},{\bm{Y}}\in[0,1]^{N\times F}$ by flattening them into vectors (e.g. $\text{flatten}({\bm{X}})\in[0,1]^{N\cdot F}$ ) results in an undesired outcome. The cross entropy loss has only a single global optima because it does not consider the permutations between $N$ objects, e.g., for ${\bm{X}}=[{\bm{o}}_{1},{\bm{o}}_{2},{\bm{o}}_{3}]$ ( ${\bm{o}}_{i}\in[0,1]^{F}$ ), ${\bm{Y}}=[{\bm{o}}_{2},{\bm{o}}_{3},{\bm{o}}_{1}]$ is not the global minima of $\mathrm{H}(\text{flatten}({\bm{X}}),\text{flatten}({\bm{Y}}))$ . Previous approach (Probst, 2018) tried to solve this problem by learning an additional permutation matrix that “fixes” the order, basically requiring to memorize the ordering.

We take a different approach of directly fixing this loss function. The target objective is to maximize the probability of two sets being equal, thus ideally, at the global minima, two sets should be equal. The key to understand our main contribution in the next section is that the above expansion implicitly assumes a certain definition of the equivalence between vectors:

[TABLE]

We argue that, in order to measure the similarity (e.g. likelihood, cross entropy) between sets, we similarly need to start from the very definition of the equivalence between sets.

Equivalence of two sets $S,T$ is defined as:

[TABLE]

However, under the assumption that $|S|=|T|=N$ and $T$ contains $N$ distinct elements (no duplicates), $S\supseteq T$ is a sufficient condition for $S=T$ . (Proof: If $S\supseteq T$ and $S\not\subseteq T$ , there are some $s^{\prime}\in S$ such that $s^{\prime}\not\in T$ . Since $N$ distinct elements in $T$ are also included in $S$ , $s^{\prime}$ becomes $S$ ’s $N+1$ -th element, which contradicts $|S|=N$ . Note that this proof did not depend on the distinctness of $S$ ’s elements.) Under this condition, therefore,

[TABLE]

We now translate this logical formula into the corresponding log likelihood. Assume a $N\times F$ dimensional, binary random variable ${\mathbf{V}}=[{\mathbf{v}}_{1},\ldots{\mathbf{v}}_{N}]$ , and its value ${\bm{V}}=[{\bm{v}}_{1}\ldots{\bm{v}}_{N}]\in{\left\{0,1\right\}}^{N\times F}$ . The target value we aim to compute is the following negative log likelihood

[TABLE]

Since $p_{\rm{data}}$ produces the dataset $\mathcal{X}$ , it satisfies $p_{\rm{data}}({\mathbf{V}}={\bm{V}})=0$ when ${\bm{V}}$ contains duplicates. This allows us to ignore those ${\bm{V}}$ s from the summation and also perform the following transformation:

[TABLE]

The inequality comes from ignoring the possibility that two random vectors taking the same value. The equality holds at the global minima. Next we compute the expectation $\mathbb{E}_{{\mathbf{V}}\sim p_{\rm{data}}}$ of this sum:

[TABLE]

The inequality is due to Jensen’s inequality applied to logsumexp, which is convex. This gives the upper bound of the negative log likelihood between sets which we call the Set Cross Entropy.

Set Cross Entropy has the following characteristics: First, compared to the original cross entropy loss, which has the single global minima, SCE has exponentially many number of global minima by making every permutations of the global minima also the global minima.

Next, notice that logsumexp is a smooth upper approximation of the maximum, therefore the set average distance $A_{1\mathrm{H}}$ is an upper bound of $\mathrm{SH}$ 222The equality holds only for $N=1$ .:

[TABLE]

Intuitively, this is because the set average returns a value which does not account for the possibility that the current closest ${\bm{y}}=\operatorname*{arg\,min}_{\bm{y}}\mathrm{H}({\bm{x}},{\bm{y}})$ of ${\bm{x}}$ may not converge to the ${\bm{x}}$ in the future during the training.

We illustrate this by comparing two examples: Let ${\bm{X}}={\left\{[0,1],[0,0]\right\}}$ , ${\bm{Y}}_{1}={\left\{[0.1,0.5],[0.1,0.5]\right\}}$ and ${\bm{Y}}_{2}={\left\{[0.1,0.5],[0.9,0.5]\right\}}$ . The set cross entropy (Eq.9) reports the smaller loss for $\mathrm{SH}({\bm{X}},{\bm{Y}}_{1})=-\log 0.81\approx 0.09$ than for $\mathrm{SH}({\bm{X}},{\bm{Y}}_{2})=-\log 0.25\approx 0.60$ . This is reasonable because the global minima is given when the first axis of both $y$ s are 0 — ${\bm{Y}}_{2}$ should be more penalized than ${\bm{Y}}_{1}$ for the $0.9$ in the second element. In contrast, $A_{1\mathrm{H}}({\bm{X}},{\bm{Y}})$ considers only the closest element ( $\operatorname*{arg\,min}_{{\bm{y}}\in{\bm{Y}}}\mathrm{H}({\bm{x}},{\bm{y}})=[0.1,0.5]$ ) for each ${\bm{x}}$ , therefore returns the same loss $=-\log 0.2025\approx 0.69$ for both cases, ignoring $[0.9,0.5]$ completely. In fact, $A_{1\mathrm{H}}({\bm{X}},{\bm{Y}})$ has zero gradient at ${\bm{Y}}={\left\{[0,0.5],[y,0.5]\right\}}$ for any $y\in[0,1]$ .

Furthermore, the following inequality suggests that the traditional cross entropy between the matrices ${\bm{X}}$ and ${\bm{Y}}$ is an upper bound of the set average of the cross entropies, therefore is an even looser upper bound of the set cross entropy. Here, ${\bm{x}}_{i},{\bm{y}}_{i}$ are the $i$ -th element of the vector representation of ${\bm{X}}$ and ${\bm{Y}}$ , respectively:

[TABLE]

This gives a natural interpretation that ignoring the permutation reduces the cross entropy.

Finally, the directed Hausdorff distance based on cross entropy ( $\mathcal{H}_{1H}$ ) is also an upper bound of SCE and the set average:

[TABLE]

4 Evaluation

4.1 Object Set Reconstruction

The purpose of the task is to obtain the latent representation of a set of objects and reconstruct them, where each object is represented as a feature vector. We prepared two datasets originating from classical AI domains: Sliding tile puzzle (8-puzzle) and Blocksworld.

Learning to reason about the object-based, set representation of the environment is crucial in the robotic systems that continuously receive the list of visible objects from the visual perception module (e.g. Redmon et al. (2016, YOLO)). In a real-world systems, appropriate handling of the set is necessary because it is unnatural to assume that the objects in the environments are always reported in the same order. In particular, the objects even in the same environment state may be reported in various orders if multiple such modules are running in parallel in an asynchronous manner.

In this experiment, we show that the permutation invariant loss function like SCE is necessary for learning to reconstruct a set in such a scenario. In this setting, a network is required to reconstruct a set from a single latent representation, while the objects as the target output may be randomly reordered each time the same set is observed and presented to the neural network.

8 Puzzle

Each feature vector as an object consists of 15 features, 9 of which represent the tile number (object ID) and the remaining 6 represent the coordinates. Each data point has 9 such vectors, corresponding to the 9 objects in a single tile configuration. The entire state space of the puzzle is 362880 states. We generated 5000 states and used the 4500 states as the training set.

We prepared an autoencoder with the permutation invariant layers (Zaheer et al., 2017) as the encoder and the fully-connected layers as the decoder. Since it uses a permutation-invariant encoder, the latent space is already guaranteed to learn a representation that is invariant to the input ordering. The key question here is then whether they can be robustly trained against the random permutations in the training examples for the output.

We tested the reconstruction ability in four scenarios: (1) In the first scenario, the dataset is provided in a standard manner. (2) In the second scenario, we augment the input dataset by repeating the elements 5 times and randomly reorder the object vectors in each set. The randomized dataset is used as the input to the network, while the target output is still the original dataset (repeated 5 times, without reordering). The purpose of this experiment is to verify the claim of the Deep Set (Zaheer et al., 2017) that it is able to handle the input in a permutation invariant manner. In order to compensate the datasize difference, the maximum training epoch is reduced by 1/5 times compared to the first scenario. (3) In the third scenario, we apply the similar operation to the target output of the network. Essentially we always feed the input in the same fixed order while forcing it to learn from the randomized target output. Each time the same data is presented, the target output has the different ordering while the input has the fixed ordering. Therefore, the training should be performed in such a way that the ordering in the output is properly ignored. (4) Finally, in the fourth scenario, the ordering in both the input and the output are randomized.

We trained the same network with four different loss functions, (a) the traditional cross entropy $\mathrm{H}$ , (b) Set Cross Entropy $\mathrm{SH}$ , (c) directed set average of the cross entropy $A_{1\mathrm{H}}$ and (d) the directed Hausdorff measure of the cross entropy $\mathcal{H}_{1\mathrm{H}}$ , resulting in 16 training scenarios in total. We performed the same experiment 10 times and took the statistics. The purpose of this is to address the potential concern about the stability of the training. We kept the same set of training/testing data, and the only difference between the runs is the random seed.

We measured the rate of the successful reconstruction among the entire dataset in the above 16 scenarios. The “successful reconstruction” is defined as follows: Recall that every data point is a discrete binary vector in the 8-Puzzle dataset while the output of the network is a continuous $N\times F$ matrix of reals between 0 and 1. Therefore, we round the output of the network to $0/1$ and directly compare the result with the input. If every object vector in a set is matched by some of the output object vector, then it is counted as a success.

The training with the standard cross entropy loss ( $\mathrm{H}$ ) succeeds in cases (1,2) while failed in cases (3,4). The case (2) reproduces the claim in (Zaheer et al., 2017) that it encodes the input in an permutation-invariant manner, while it failed in the latter cases because the training is not permutation-invariant with regard to the output. In contrast, the training with the Set Cross Entropy loss succeeds in all cases. This shows that the permutation-invariant loss function is necessary for training a network with a dataset consisting of sets. In this dataset, the training with set average distance $A_{1\mathrm{H}}$ also succeeded because it is a looser but still effective upper-bound of the negative log likelihood. In contrast, the training with Hausdorff distance failed to learn the representation at all.

Next, to address the claim that the network is able to learn from the dataset with the variable set size, we performed an experiment which applies the dummy-vector scheme (Sec. 3). In this experiment, we modified the dataset to model such a scenario by randomly dropping one to five elements out of 9 elements. The maximum number of elements is 9. The dropping scheme is specified as follows: Out of the 5000 states generated in total (including the training / testing dataset), approximately half of the states have 9 tiles, $1/4$ of the states have 8 tiles, … and $1/2^{5}$ of the states have 5 tiles. The elements to drop are selected randomly.

The results in Table 2 shows that the training with our proposed $\mathrm{SH}$ loss function achieves the best success ratio for the reconstruction when the output order is randomized. The reconstruction includes the dummy vectors, indicating that the network is able to represent not only the elements in the set but also the number of the missing elements.

Blocksworld

In order to test the reconstruction ability for the more complex feature vectors, we prepared a photo-realistic Blocksworld dataset (Fig. 2) which contains the blocks world states rendered by Blender 3D engine. There are several cylinders or cubes of various colors and sizes and two surface materials (Metal/Rubber) stacked on the floor, just like in the usual STRIPS (McDermott, 2000) Blocksworld domain. In this domain, three actions are performed: move a block onto another stack or on the floor, and polish/unpolish a block i.e. change the surface of a block from Metal to Rubber or vice versa. All actions are applicable only when the block is on top of a stack or on the floor. The latter actions allow changes in the non-coordinate features of the object vectors.

The dataset generator produces a 300x200 RGB image and a state description which contains the bounding boxes (bbox) of the objects. Extracting these bboxes is a object recognition task we do not address in this paper, and ideally, should be performed by a system like YOLO (Redmon et al., 2016). We resized the extracted image patches in the bboxes to 32x32 RGB, compressed it into a feature vector of 1024 dimensions with a convolutional autoencoder, then concatenated it with the bbox $(x_{1},y_{1},x_{2},y_{2})$ which is discretized by 5 pixels and encoded as 1-hot vectors (60/40 categories for $x$ / $y$ -axes), resulting in 1224 features per object. The generator is able to enumerate all possible states (80640 states for 5 blocks and 3 stacks). We used 2250 states as the training set and 250 states as the test set.

We also verified the results qualitatively. Some reconstruction results are visualized in Fig. 3. These visualizations are generated by pasting the image patches decoded from the first 1024 axes of the reconstructed 1224-D feature vectors in a position specified by the reconstructed bounding box in the last 200 axes.

In Table 3, we measured the difference between the visualization results of the input and the output (as shown in Fig. 3) by the Root Mean Squared Error of the pixel values averaged over RGB, pixels and the dataset. Each pixel is represented in the $[0,1]$ range (closed set of reals between 0 and 1), thus the 0.1 on the table means that pixels differ by 0.1 on average.

4.2 Rule Learning ILP Tasks

The purpose of this task is to learn to generate the prerequisites (body) of the first-order-logic horn clauses from the head of the clause. Unlike the previous tasks, this task is not an autoencoding task. The bodies are considered as a set because the order of the terms inside a body does not matter for the clause to be satisfied.

The main purpose of this experiment is to show the effectiveness of our approach on set generation, not to demonstrate a more general neural theorem proving system. An interesting avenue of future work is to see how our approach can help the existing work on neural theorem proving.

We used a Countries dataset (Bouchard et al., 2015) that contains 163 countries and trained the models for $n$ -hop neighbor relations. For example, for $n=2$ , given a head $\textsf{neighbor2}(\textsf{austria},\textsf{germany},\textsf{belgium})$ as an input, the task is to predict the body ${\left\{\textsf{neighborOf}(\textsf{austria},\textsf{germany}),\textsf{neighborOf}(\textsf{germany},\textsf{belgium})\right\}}$ , which is a set of two terms. This is a weaker form of a more general backward chaining used in Neural Theorem Proving (Rocktäschel & Riedel, 2017) because the output does not contain free variables.

In the $n$ -neighbor scenario, the input is a $2+163(n+1)$ -dimensional vector, which consists of a one-hot label of 2 categories for the predicate of the head, and $n+1$ one-hot labels of 163 categories for the arguments of the head. For example, a head $\textsf{neighbor2}(\textsf{austria},\textsf{germany},\textsf{belgium})$ spends 2 dimensions for identifying the predicate neighbor2, and three 1-hot vectors of 163 categories for representing austria,germany,belgium. The output is a $n\times 328$ matrix, where each row represents a binary predicate $(328=2+2\cdot 163)$ such as ${\left\{\textsf{neighborOf}(\textsf{austria},\textsf{germany}),\textsf{neighborOf}(\textsf{germany},\textsf{belgium})\right\}}$ for $n=2$ . Each element uses 2 dimensions for identifying the predicate head neighborOf and two 1-hot vectors of 163 categories for the arguments (e.g. austria and germany).

We trained the network with the neighbor- $n$ datasets ranging from $n=2$ to $n=5$ (see the result table for the detailed domain characteristics). During the inference, the index of the largest output of the softmax is treated as the answer. We counted the ratio of the clauses across the test set where every body term matches against one of the output terms. The output data (body terms) may have an arbitrary ordering, and we have another variant similar to the previous experiment: In the randomized body order dataset, the dataset is repeated 5 times, while the ordering of the terms inside each body is randomly shuffled.

Table 4 shows that the network with Set Cross Entropy achieved the best accuracy, set average generally comes in the second and other losses struggled. This trend was observed both in the the randomized and the fixed-ordering dataset. This shows that the Set Cross Entropy relaxes the search space by adding more global minima and making the training easier.

5 Discussion

Vinyals et al. (2016) repeatedly emphasized the advantage of limiting the possible equivalence classes of the outputs by engineering the training data for solving the combinatorial problems. For example, they pre-sorted the training example for the Delaunay triangulation (set of triangles) by the lexicographical order and trained an LSTM model with the standard cross entropy (Vinyals et al., 2015). However, this is an ad-hoc method that depends on the particular domain knowledge and, as we have shown, the difficulty of learning such an output was caused by the loss function that considers the ordering. Moreover, we showed that the standard cross entropy and the set average metrics are the less tighter upper bound of the proposed Set Cross Entropy and also that it empirically outperforms the standard cross entropy in the theory learning task, even if a specific ordering is imposed on the output.

One limitation of the current approach is that the set cross entropy contains a double-loop, therefore takes $O(N^{2})$ runtime for a set of $N$ objects. However, unlike the algorithm proposed in Probst (2018), which uses a sequential Gale-Shapley algorithm which also uses $O(N^{2})$ runtime, our loss function can be efficiently implemented on GPUs because it consists of a simple combination of logsumexp and summation. Still, improving the runtime complexity is an important direction for future work because the other set reconstruction tasks, including 3D point clouds datasets like Shapenet (Chang et al., 2015), may contain a much larger number of elements in each set.

Another direction for future work is to use the Long Short Term Memory (Hochreiter & Schmidhuber, 1997) for handling the sets without imposing the shared upper bound on the number of elements in a set, which has been already explored in the literature (Vinyals et al., 2015; 2016). Since our approach is agnostic to the type of the neural network, they are orthogonal to our approach.

6 Conclusion

In this paper, we proposed Set Cross Entropy, a measure that models the likelihood between the sets of probability distributions. When the output of the neural network model can be naturally regarded as a set, Set Cross Entropy is able to relax the search space by making the permutations of a global minima also the global minima, and makes the training easier. This is in contrast to the existing approaches that try to correct the ordering of the output by learning a permutation matrix, or an ad-hoc methods that reorder the dataset using the domain-specific expert knowledge.

Training based on the Set Cross Entropy is robust against the dynamic dataset which contains the vectors whose internal ordering may change time to time in an arbitrary manner. It can also handle the sets of variable sizes by inserting dummy vectors.

We demonstrated the effectiveness of the approach by comparing Set Cross Entropy against the normal cross entropy, as well as the other set-based metrics such as Hausdorff distance or Set Average (Chamfer) distance. Set Cross Entropy empirically outperformed all of these variants, and we also provided a proof that all of these variants are the looser upper-bound of the proposed set cross entropy.

Appendix

6.1 Network Model for 8 Puzzle

As mentioned in the earlier sections, the network has a permutation invariant encoder and the fully-connected decoder.

The input to the network is a $9\times 15$ matrix, where the first dimension represents the objects and the second dimension represents the features of each object. The encoder has two 1D convolution layers of 1000 neurons with filter size 1, modeling the element-wise network $\rho$ . The output of these layers is then aggregated by taking the sum of the first dimension. The result is then fed to two another fully-connected layers of width 1000, which maps to the latent layer of 100 neurons. All encoder layers are activated by ReLU. The latent representation is regularized and activated by Gumbel-Softmax Maddison et al. (2017); Jang et al. (2017) as the input is a categorical model.

The decoder consists of three fully-connected layers with dropout and batch normalization as shown below:

[TABLE]

The last layer is then split into $9\times 9$ , $9\times 3$ , $9\times 3$ matrices and separately activated by softmax, reflecting the input dataset (Fig. 1).

6.2 Network Model for Blockswolrd

The same network as the 8-puzzle was used, except that the input and the output is a $5\times 1224$ matrix. The activations of the last layer is different: The first 1024 features are activated by a sigmoid function, while the 200 features are divided into 40, 60, 40, 60 dimensions (for the one-hot bounding box information) and are separately activated by softmax.

6.3 Feature Extraction for Blocksworld

The 32x32 RGB image patches in the Blocksworld states are compressed into the feature vectors that are later used as the input. The image features are learned by a convolutional autoencoder depicted in Fig. 4 and Fig. 5.

6.4 Network Model for Rule Learning

We removed the encoder and the latent layer from the above models, and connected the input directly to the decoder. While the decoder has the same types of layers, the width is shrinked to 400. In the $n$ -neighbor scenario, the input is a $2+163(n+1)$ vector, which consists of a one-hot label of 2 categories for the predicate of the head, and $n+1$ one-hot labels of 163 categories for the arguments of the head. 163 categories corresponds to the number of countries in the Countries dataset (Bouchard et al., 2015). The output is a $[n,328]$ matrix, where each row represents a binary predicate $(328=2+2\cdot 163)$ .

6.5 Example Application of the Permutation-Invariant Representation & Reconstruction

To address the practical utility of “set reconstruction” or “set autoencoding”, we added a new experiment. We modified Latplan (Asai & Fukunaga, 2018) neural-symbolic classical planning system, a system that operates on a discrete symbolic latent space of the real-valued inputs and runs Dijkstra’s/A* search using a state-of-the-art symbolic classical planning solver. We modified Latplan to take the set-of-object-feature-vector input rather than images. It is a high-level task planner (unlike motion planning / actuator control) that has implications on robotic systems, which perceives a set of inputs already preprocessed by the external system. For example, the image input is first fed into Object Recognition system (e.g., YOLO, (Redmon et al., 2016)) and the planner receives a set of feature vectors extracted from the image patches segmented from the raw image, rather than feeding the image input directly to the planning system.

Latplan system learns the binary latent space of an arbitrary raw input (e.g., images) with a Gumbel-Softmax variational autoencoder, learns a discrete state space from the transition examples, and runs a symbolic, systematic search algorithm such as Dijkstra or A* search which guarantee the optimality of the solution. Unlike RL-based planning systems, the search agent does not contain the learning aspects. The discrete plan in the latent space is mapped back to the raw image visualization of the plan execution, which requires the reconstruction capability of (V)AE. A similar system replacing Gumbel Softmax VAE with Causal InfoGAN was later proposed (Kurutach et al., 2018).

We replaced Latplan’s Gumbel-Softmax VAE with our autoencoder used in the 8-Puzzle and the Blocksworld experiments (Appendix, Sec. 6.1,Sec. 6.2). Our autoencoder also uses Gumbel Softmax in the latent layer, but it uses (Zaheer et al., 2017) encoder and is trained with Set Cross Entropy.

When the network learned the representation, it guarantees that the planner finds a solution because the search algorithm (e.g., Dijkstra) is a complete, systematic, symbolic search algorithm, which guarantees to find a solution whenever it is reachable in the state space. If the network cannot learn the permutation-invariant representation, the system cannot solve the problem and/or return the human-comprehensive visualization. This makes the specific permutation-invariant representation using (Zaheer et al., 2017) and the proposed Set Cross Entropy necessary when the input is given as a set of future vectors.

8 Puzzle

First, we performed a training on a dataset in which the object vector ordering is randomized. The autoencoder compresses the $15\times 9=135$ -bit binary representation (object vectors) into a permutation-invariant 100-bit discrete latent binary representation. We provided 5000 states for training the autoencoder, while the search space consists of 362880 ( $=9!$ ) states and 967680 transitions.

Note that each state have $9!$ variations due to the permutations in the way the tiles and the locations are reported. This also increases the number of transition quadratically ( $(9!)^{2}$ ).

We generated 40 problem instances of 8-puzzle each generated by a random walk from the goal state. 40 instances consist of 20 instances each generated by a 7-steps random walk and another 20 by 14 steps. We solved 40 instances using Fast Downward classical planner Helmert (2004) with blind heuristics in order to remove the effect of heuristics.

We compared the number of problems successfully solved by two variations of Latplan where each uses the autoencoder trained with Set Average and Set Cross Entropy, respectively, for encoding the input into binary latent space. Both versions managed to solve all instances because both Set Average and Set Cross Entropy managed to train the AE from 5000 examples with sufficient accuracy. All solutions were correct (checked manually). Since the search algorithm being used is optimal, the quality of the solution was also identical.

Blocksworld

We solved 30 planning instances in a 4-blocks, 3-stacks environment. The search space consists of 5760 states and 34560 transitions, and each state has $4!$ variations due to permutations of 4 blocks. We provided 1000 randomly selected states for training the autoencoder. The instances are generated by taking a random initial state and choosing a goal state by the 3, 7, or 14 steps random walks (10 instances each). We used the same planner configuration for all instances. The correctness of the produced plans is checked manually.

We compared the number of problems successfully solved by Latplan between two variations of Latplan using the autoencoder trained with Set Average and Set Cross Entropy, respectively. For the total of 30 instances, both Latplan+Set Avg and Latplan+SCE returned plans, however the plans returned by Latplan+Set Avg were correct in 11 instances, while Latplan+SCE returned 14 correct instances (Details in Table 5).

As the autoencoder trained by Set Average had more substantial reconstruction error, it sometimes fails to capture the essential feature of the input, causing the system to return an invalid plan. The common error was changing the surface of the blocks or swapping the blocks without a proper action needed, e.g., moving more than two blocks, move a block and polish another block in a single time step.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adams & Zemel (2011) Ryan Prescott Adams and Richard S Zemel. Ranking via Sinkhorn Propagation. ar Xiv preprint ar Xiv:1106.1925 , 2011.
2Asai & Fukunaga (2018) Masataro Asai and Alex Fukunaga. Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary. In Proc. of AAAI Conference on Artificial Intelligence , 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI 18/paper/view/16302 .
3Bouchard et al. (2015) Guillaume Bouchard, Sameer Singh, and Theo Trouillon. On Approximate Reasoning Capabilities of Low-Rank Vector Spaces. AAAI Spring Syposium on Knowledge Representation and Reasoning (KRR): Integrating Symbolic and Neural Approaches , 2015.
4Chang et al. (2015) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shape Net: An Information-Rich 3D Model Repository. Technical Report ar Xiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
5Dubuisson & Jain (1994) M-P Dubuisson and Anil K Jain. A Modified Hausdorff Distance for Object Matching. In Proceedings of 12th International Conference on Pattern Recognition , pp. 566–568. IEEE, 1994.
6Guttenberg et al. (2016) Nicholas Guttenberg, Nathaniel Virgo, Olaf Witkowski, Hidetoshi Aoki, and Ryota Kanai. Permutation-Equivariant Neural Networks Applied to Dynamics Prediction. ar Xiv preprint ar Xiv:1612.04530 , 2016.
7Helmert (2004) Malte Helmert. A Planning Heuristic Based on Causal Graph Analysis. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS) , pp. 161–170, 2004. URL http://www.aaai.org/Library/ICAPS/2004/icaps 04-021.php .
8Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation , 9(8):1735–1780, 1997.