Learning by stochastic serializations

Pablo Strasser; Stephane Armand; Stephane Marchand-Maillet; Alexandros; Kalousis

arXiv:1905.11245·cs.LG·May 28, 2019

Learning by stochastic serializations

Pablo Strasser, Stephane Armand, Stephane Marchand-Maillet, Alexandros, Kalousis

PDF

Open Access

TL;DR

This paper introduces a generic learning framework that maps complex structures to serializations, enabling the use of sequence-based density estimators, with sampling methods that preserve structural properties and improve learning efficiency.

Contribution

It proposes a novel serialization approach for complex structures, allowing generic sequence-based learning applicable across various structures, with effective sampling to capture structural statistics.

Findings

01

Competitive or superior to specialized algorithms

02

Provides protection from overfitting through sampling

03

Effective sampling preserves structural statistics

Abstract

Complex structures are typical in machine learning. Tailoring learning algorithms for every structure requires an effort that may be saved by defining a generic learning procedure adaptive to any complex structure. In this paper, we propose to map any complex structure onto a generic form, called serialization, over which we can apply any sequence-based density estimator. We then show how to transfer the learned density back onto the space of original structures. To expose the learning procedure to the structural particularities of the original structures, we take care that the serializations reflect accurately the structures' properties. Enumerating all serializations is infeasible. We propose an effective way to sample representative serializations from the complete set of serializations which preserves the statistics of the complete set. Our method is competitive or better than state…

Tables11

Table 1. Table 1 : Test accuracy on the set problems.

Algorithm	ModelNet40	ModelNet10
S-RNN $T = 500$ $λ = 1$	$82 %$	$87 %$
S-RNN $T = 5000$ $λ = 0$	$82 %$	$87 %$
S-RNN Curiculum	$81 %$	$87 %$
Deep-Set $T = 500$	$82 %$	-
Deep-Set $T = 5000$	$90 %$	-
RotationNet	$97 %$	$98 %$

Table 2. Table 2 : Predictive performance on QM9 dataset, Pearson correlation coefficient with the target.

Algorithm	mu	alpha	HOMO	LUMO	gap	R2	ZPVE	Cv	U0	u298	h298	g298
S-RNN Non-Canonical SMILES	0.75	0.993	0.93	0.985	0.97	0.970	1.000	0.995	1.000	1.000	1.000	1.000
S-RNN Canonical SMILES	0.64	0.994	0.90	0.980	0.95	0.968	1.000	0.994	1.000	1.000	1.000	1.000
Best Baseline w/o 3D coord	0.76	0.963	0.91	0.977	0.96	0.977	0.988	0.978	0.994	0.994	0.994	0.994
Best Baseline w 3D coord	0.97	0.993	0.96	0.988	0.98	0.998	0.999	0.997	0.998	0.998	0.998	0.998

Table 3. Table 3 : Test set negative log likelihood on the unconditional (left) and conditional (right) gait generation. Bold indicates performances that are significantly better than the non-bold.

# of Angles (k)	S-RNN	RNN
2	$- 6.8$	7.3
4	$- 𝟔𝟔$	-14
8	$- 𝟑𝟐$	83.3

Table 4. Table 4 : Performance results for the conditional tree generation task.

Algorithm	Node F1	Node precision	Node Recall	Edge F1	Edge Precision	Edge Recall
S-RNN	88.95%	87.82%	90.79%	83.43%	82.22%	85.47%
DRNN	74.51%	59.37%	100%	65.86%	49.10%	100%

Table 5. Table 5 : Average predictive error on the tree regression task.

Algorithm	Error
S-RNN ordered trees	$7.18 °C$
S-RNN unordered trees	$6.15 °C$
TreeESN M	$2.78 °C$
TreeESN R	$8.09 °C$

Table 6. Table 6: Predictive performance on the QM9 dataset. Correlation between predicted and real value.

Algorithm	mu	alpha	HOMO	LUMO	gap	R2
S-RNN Non-Canonical SMILES	0.75	0.993	0.93	0.985	0.97	0.970
S-RNN Canonical SMILES	0.64	0.994	0.90	0.980	0.95	0.968
tf regression	0.72	0.659	0.86	0.949	0.93	0.734
tf regression ft	0.73	0.963	0.84	0.939	0.92	0.977
graph conv reg	0.76	0.810	0.91	0.977	0.96	0.824
weave regression	0.71	0.954	0.89	0.966	0.94	0.947
dtnn	0.97	0.993	0.96	0.988	0.98	0.998

Table 7. Table 7: Predictive performance on the QM9 dataset. Correlation between predicted and real value.

Algorithm	ZPVE	Cv	U0	u298	h298u	g298
S-RNN Non-Canonical SMILES	1.000	0.995	1.000	1.000	1.000	1.000
S-RNN Canonical SMILES	1.000	0.994	1.000	1.000	1.000	1.000
tf regression	0.880	0.741	0.671	0.671	0.671	0.671
tf regression ft	0.988	0.978	0.994	0.994	0.994	0.994
graph conv reg	0.927	0.824	0.741	0.741	0.741	0.741
weave regression	0.982	0.963	0.984	0.984	0.985	0.985
dtnn	0.999	0.997	0.998	0.998	0.998	0.998

Table 8. Table 8: Predictive performance on the Guacamole dataset, Pearson correlation coefficient with the target.

Algorithm	logP	mol_weight	num_atoms	num_H_donors	tpsa
S-RNN Non-Canonical SMILES	0.999	0.999	0.999	0.999	0.999
S-RNN Canonical SMILES	1.000	0.999	0.999	0.999	0.999

Table 9. Table 9 : Test set negative log likelihood, artificial dynamical system, supervised setting. Bold indicate performances that are significantly better than the non-bold.

S-RNN	S-RNN $λ = 10^{2}$	S-RNN $λ = 10^{4}$	RNN
$- 𝟏𝟎𝟕$	$- 𝟏𝟎𝟑$	-57	-41

Table 10. Table 10 : Test set negative log likelihood on the unsupervised Gait problem. Bold indicate performances that are significantly better than non-bold.

# of Angles	S-RNN	RNN
2	$- 6.8$	7.3
4	$- 𝟔𝟔$	-14
8	$- 𝟑𝟐$	83.3

Table 11. Table 11 : Test negative log likelihood on the supervised Gait problem. Bold indicate the lowest statistically significative better negative log likelihood using a t 𝑡 t -test.

Angles	S-RNN $λ = 0$	S-RNN $λ = 100$		S-RNN $λ = 10000$		RNN
	mean	mean	p-value	mean	p-value	mean	p-value
1	-47	-44	$𝟐 \cdot {𝟏𝟎}^{- 𝟓}$	-49	1.0	$- 𝟓𝟑$	1
2	$- 𝟏𝟐𝟓$	-114	$𝟎$	-100	$𝟎$	-97	$𝟎$

Equations24

P_{X, ϕ} (x) ≜ P_{A, ϕ} ({a_{j} \in A ∣ X (a_{j}) = x}) = P_{A, ϕ} (X^{- 1} (x)) = a \in X^{- 1} (x) \sum P_{A, ϕ} (a)

P_{X, ϕ} (x) ≜ P_{A, ϕ} ({a_{j} \in A ∣ X (a_{j}) = x}) = P_{A, ϕ} (X^{- 1} (x)) = a \in X^{- 1} (x) \sum P_{A, ϕ} (a)

P_{X, ϕ} (x) = \frac{P ( o , x )}{P ( o ∣ x )} = \frac{\sum _{a \in X^{- 1} (x) \cap O^{- 1} (o)} P _{A, ϕ} ( a )}{P ( o ∣ x )} \forall o \in O

P_{X, ϕ} (x) = \frac{P ( o , x )}{P ( o ∣ x )} = \frac{\sum _{a \in X^{- 1} (x) \cap O^{- 1} (o)} P _{A, ϕ} ( a )}{P ( o ∣ x )} \forall o \in O

P (o ∣ x) ≜ \frac{μ ( X ^{- 1} ( x ) \cap O ^{- 1} ( o ))}{μ ( X ^{- 1} ( x ))}

P (o ∣ x) ≜ \frac{μ ( X ^{- 1} ( x ) \cap O ^{- 1} ( o ))}{μ ( X ^{- 1} ( x ))}

P (a_{i}^{t + 1} = b ∣ a_{i}^{[1 : t]})

P (a_{i}^{t + 1} = b ∣ a_{i}^{[1 : t]})

s^{t + 1} = f (s^{t}, a^{t + 1}) \forall0 \leq t < T

s^{t + 1} = f (s^{t}, a^{t + 1}) \forall0 \leq t < T

P (a^{[t + 1 : T]} ∣ a^{[1 : t]}) = P (a^{[t + 1 : T]} ∣ s^{t})

P (a^{[t + 1 : T]} ∣ a^{[1 : t]}) = P (a^{[t + 1 : T]} ∣ s^{t})

P (a_{i}^{t + 1} = b ∣ s_{i}^{t}) = P (a_{j}^{t + 1} = b ∣ s_{j}^{t}) \forall b \in B s_{j}^{t} = s_{i}^{t}

P (a_{i}^{t + 1} = b ∣ s_{i}^{t}) = P (a_{j}^{t + 1} = b ∣ s_{j}^{t}) \forall b \in B s_{j}^{t} = s_{i}^{t}

h^{0} = 0; h^{t} = σ (W_{hh} h^{t - 1} + W_{hi} a^{t}); P (a^{1}, \dots, a^{T}) = t = 1 \prod T P_{θ} (a^{t} ∣ h^{t - 1})

h^{0} = 0; h^{t} = σ (W_{hh} h^{t - 1} + W_{hi} a^{t}); P (a^{1}, \dots, a^{T}) = t = 1 \prod T P_{θ} (a^{t} ∣ h^{t - 1})

L =

L =

P (X = x_{test}) \approx \frac{1}{m} j = 1 \sum m \frac{\sum _{a \in X^{- 1} (x_{test}) \cap O^{- 1} (o_{j})} P _{A, ϕ} ( a )}{P ( O = o _{j} ∣ X = x _{test} )}

P (X = x_{test}) \approx \frac{1}{m} j = 1 \sum m \frac{\sum _{a \in X^{- 1} (x_{test}) \cap O^{- 1} (o_{j})} P _{A, ϕ} ( a )}{P ( O = o _{j} ∣ X = x _{test} )}

\frac{d ^{2} y _{1} ( t )}{d t ^{2}}

\frac{d ^{2} y _{1} ( t )}{d t ^{2}}

\frac{d ^{2} y _{2} ( t )}{d t ^{2}}

\frac{d ^{2} y _{3} ( t )}{d t ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Topic Modeling · Domain Adaptation and Few-Shot Learning

Full text

Learning by stochastic serializations

Pablo Strasser11footnotemark: 1

Department of Business Informatics

University of Applied Sciences

Western Switzerland &Stéphane Armand

Willy Taillard Laboratory of Kinesiology

Geneva University Hospitals and Geneva University

Switzerland &Stephane Marchand-Maillet

Department of Computer Science

University of Geneva

Switzerland &Alexandros Kalousis11footnotemark: 1

Department of Business Informatics

University of Applied Sciences

Western Switzerland

Abstract

Complex structures are typical in machine learning. Tailoring learning algorithms for every structure requires an effort that may be saved by defining a generic learning procedure adaptive to any complex structure. In this paper, we propose to map any complex structure onto a generic form, called serialization, over which we can apply any sequence-based density estimator. We then show how to transfer the learned density back onto the space of original structures. To expose the learning procedure to the structural particularities of the original structures, we take care that the serializations reflect accurately the structures’ properties. Enumerating all serializations is infeasible. We propose an effective way to sample representative serializations from the complete set of serializations which preserves the statistics of the complete set. Our method is competitive or better than state of the art learning algorithms that have been specifically designed for given structures. In addition, since the serialization involves sampling from a combinatorial process it provides considerable protection from overfitting, which we clearly demonstrate on a number of experiments.

**footnotetext: Also member of the Department of Computer Science of the University of Geneva

1 Introduction

Many learning problems are defined over complex instance structures, e.g. learning instances can be sets, trees, sequences etc. One typical approach to such problems is the so-called propositionalisation, [22], in which one maps such complex learning instances to vectorial representations, potentially losing discriminative information along the way. Yet another approach is to develop learning algorithms tailored to the representation particularities of any given problem, preserving in that manner all information, at the cost of significant conceptual and development effort.

Instead, we propose to decouple learning from the structural specificities of the learning instances. To do so, we define an informed, randomized, mapping from any given complex-structured instance onto multiple and equivalent sequences. We then learn over the space of sequences and map back the result of the learning onto the space of the original instance structures. When mapping a complex instance structure onto a sequence, we must retain the specificity of the original structure in order to guarantee a revertible mapping and preserve its properties in order to learn correctly. We do so by carrying over the mapping a set of constraints and properties of the original instance structure to the sequences over which we learn.

Our approach opens a sound and systematic way to perform learning over arbitrary complex instance structures, and allows us to directly use any learning algorithm defined over sequences to do the learning. The fact that we map the instances to multiple and equivalent sequences over which we learn brings significant advantages when it comes to overfitting avoidance. We experiment with generative and discriminative learning settings over complex structures for a variety of problems and complex structures.

2 Related Work

The recent surge on generative modeling has seen the development of generative methods that can learn, implicit or explicit, distributions over complex instances and sample from them. We have applications of generative modeling in problems where the learning instances are two dimensional structures, e.g. images [2, 13, 29, 27, 19], sequences for speech [35], text [14], translation [31], and graphs, e.g. drug modeling [12].

Of more interest to our work are generative models for structures such as sets [36, 38] and trees [1, 39, 9, 23, 40]. Such models incorporate the specificities of the original instance structure on how they factorize the generative distribution to a product of conditionals. The factorization controls the dependencies between the components of a given learning instance, ensuring that the inherent properties of the original instance structure are preserved. Examples of structural properties that can be expressed via the factorization of the conditionals are order independence and invariance to invertible transformations of the conditioned variable [24, 34]. These properties are used to express invariances of specific complex structures such as invariance to re-ordering of conditioning for sets [36, 38], invariance to re-ordering of siblings for trees, and invariance to relabeling of nodes for graphs.

In our work, we take a constructive modeling approach over the mapped sequences (serializations), we transfer the invariance properties of the original complex structures onto constraints in the procedure for constructing these serializations.

We model structure invariance via states representing the relevant information that the system has acquired at a given step of the generative process. The inherent properties of the original structure are thus expressed by which partial serializations are represented by the same state or not. For example, the invariance to re-ordering is expressed by the fact that building a given sub-structure by following two different orderings leads to the same state. The specific representation of a state is domain-specific and also allows to incorporate further information about the inherent properties of the original structures.

Of direct relevance to our work is the Grammar Variational Autoencoder (GVA) [20] which learns generative models over arbitrary complex structures. There structures are described as sequences of production rules of a context-free grammar, sequences over which a standard Variational Autoencoder is then trained. The original instances are reconstructed by predicting the sequence of production rules. The GVA can be viewed as a special case of our framework since a context-free grammar can be transformed into a state-transition function. In the latter, a state contains a representation of what has been parsed so far, and transitions correspond to the application of the different production rules. Our framework makes explicit the notion of a state and uses it to enforce constraints on the probabilities of occurrences of training serializations.

Works in the area of graph neural networks [3, 11] study and propose models based on the transfer of information (messages) between nodes and edges. Our method has a similar motivation, it transfers information only to the parts of the model that need it. We have chosen to use a simpler architecture, sequences instead of graphs, at the cost of additional preprocessing; nevertheless our method can be generalized to graphs. The question though of whether the additional complexity would bring performance improvements is open and warrants investigation.

3 Method

We are given a space $\mathbb{X}$ of complex, structured, learning instances, equipped with an unknown probability distribution $P_{\mathbb X}$ , and a training set $\mathcal{X}=\{x_{1},\ldots,x_{n}|x_{i}\in\mathbb{X}\}$ of $n$ instances sampled i.i.d from $P_{\mathbb X}$ . Our goal is to a learn a model $P_{\mathbb X,\boldsymbol{\phi}}$ of $P_{\mathbb X}$ , where $\boldsymbol{\phi}$ is the set of model parameters.

3.1 The space of serializations

To learn the probability distribution over the space of the original complex structures $\mathbb{X}$ , we first map structured instances $x\in\mathbb{X}$ onto sequences $a\in\mathbb{A}$ . $\mathbb{A}$ is the space of sequences of finite but unknown length over some finite lexicon $\mathbb{B}$ ; the latter is given by the domain of the problem. We denote by $\mathcal{A}\subseteq\mathbb{A}$ the set of sequences generated by taking the maps of the training instances $x\in\mathcal{X}$ . We learn a probability distribution, $P_{\mathbb A,\boldsymbol{\phi}}$ , over $\mathbb{A}$ , by training an RNN with $\mathcal{A}$ . We use the learned distribution to construct $P_{\mathbb X,\boldsymbol{\phi}}$ in the original space. We call our proposal S-RNN (as Structural RNN) since it brings additional structural information into a RNN-based learning of sequences.

Since $\mathcal{A}$ is the only access the learner has to the structure particularities of the specific problem and their constraints, it is critical that we tranfer to it as much information as possible from the original structures. To carry over the constraints from $\mathbb{X}$ to $\mathbb{A}$ we map every $x\in\mathcal{X}$ onto multiple sequences $a_{j}$ , serializations of $x$ , exhibiting the invariance properties of the original structure $x$ . We denote by $a_{j}=[a_{j}^{1},a_{j}^{2},\dots,a_{j}^{T}]\in\mathbb{A}$ a serialization of $x$ . Its elements are $a_{j}^{i}\in\mathbb B$ ; $T$ is the length of $a_{j}$ . We discuss later the serializations and the serialization algorithm we use. For now, we simply note that the serialization algorithm parses the complex structure and produces in a sequential manner a serialization $a_{j}$ . For example, a set $x=\{{\sf A},{\sf B},{\sf C}\}$ can be mapped onto any of its multiple serializations, say $a_{1}=[{\sf A,B,C}]$ , $a_{2}=[{\sf A,C,B}]$ , $a_{3}=[{\sf B,C,A}]$ , etc, thus exhibiting order invariance. A partial serialization of $x$ is a subsequence $a_{j}^{[1:d]}=[a_{j}^{1},a_{j}^{2},\dots,a_{j}^{d}],d\leq T$ .

Invariance properties on $x$ will be mapped onto invariance to local or global reordering in $a_{j}$ (eg “swapping elements in the serialization of a set doesn’t matter”) and/or conditioning the occurrence of subsequences in $a_{j}$ to its preceding subsequences (eg “if ’A’ has occurred in the partial serialization, it will not appear anymore since elements occur only once within a set”).

The serialization algorithm defines a mapping from an original instance $x$ onto a stochastic process whose sampled realizations will create the serializations $a_{j}$ , resulting in $\mathcal{A}$ . This mapping (serialization algorithm) comes from domain knowledge and must be revertible, i.e a serialization $a_{j}$ generated from instance $x$ reconstructs/de-serializes to $x$ and only $x$ .

In the next section, we discuss the mapping from $\mathbb{X}$ to $\mathbb{A}$ and how from a probability distribution learned over $\mathbb{A}$ we can extract a corresponding distribution on $\mathbb{X}$ ; we impose no constraints on the serialisations and the probability distribution on $\mathbb{A}$ . In section 3.3 we show how to impose structure constraints on the constructed sequences that also constrain the feasible probability distributions on $\mathbb{A}$ to finally construct $\mathcal{A}$ . Thus sections 3.2 and 3.3 describe only how to generate sequences which carry over the appropriate invariances of the original structures; they make no statement about the learning model. In section 3.4 we define a regulariser that imposes on the learning model the same structural constraints as the ones we use to generate the sequences, ie the regulariser brings into the model the same inductive bias we used to generate the sequences.

3.2 Serialization with no structural constraints

We assume that $\mathbb{A}$ is a probability space equipped with a distribution $P_{\mathbb A}$ , of which an estimate $P_{\mathbb A,\boldsymbol{\phi}}$ we learn by applying an RNN on the $\mathcal{A}$ set. Here, we describe our abstract model for the creation of a relevant training set $\mathcal{A}$ and how to generate a distribution $P_{\mathbb X,\boldsymbol{\phi}}$ from $P_{\mathbb A,\boldsymbol{\phi}}$ .

We assume $\mathbb{X}$ to be measurable and define the random variable $X:\mathbb{A}\rightarrow\mathbb{X}$ . We install a probability distribution $P_{\mathbb X,\boldsymbol{\phi}}$ over the original space $\mathbb{X}$ by pushing forward the distribution $P_{\mathbb A,\boldsymbol{\phi}}$ , learned in the serialization space $\mathbb{A}$ , along $X$ . The r.v. $X$ classically allows us to compute probabilities over $\mathbb{X}$ by:

[TABLE]

Following our earlier description, $X$ is the de-serialization procedure mapping a serialization $a_{j}$ to the original data $X(a_{j})=x$ . In turn, $X^{-1}$ , as a serialization process, represents the stochastic process sampling from the set of all possible serializations of $x$ , $X^{-1}(x)=\{a_{j}\in\mathbb{A}|X(a_{j})=x\}$ . Serializing via a particular serialization algorithm $X^{-1}$ , and therefore choosing a subset of all possible serializations, creates a bias in the representation of $x$ within $\mathcal{A}$ that we need to account for to maintain accurate learning. To address this bias, we introduce an abstract structure on the space $\mathbb{A}$ , which allows us to discriminate between equivalent sequences in $\mathcal{A}$ . We call this additional structure properties.

We define a random variable $O:\mathbb{A}\mapsto\mathbb{O}$ , where $\mathbb{O}$ is a measurable space of properties. Using $O^{-1}$ , we map these properties onto $\mathbb{A}$ . An illustrative example of such properties on sequences is “elements of the sequence are in alphabetical order”. Outcome $o\in\mathbb{O}$ are properties that apply to sequences in $\mathbb{A}$ in general and will help characterize serializations as follows. A sequence $a_{j}\in\mathbb{A}$ may be a serialization of $x\in\mathbb X$ (ie $a_{j}\in X^{-1}(x)$ ) and/or bear property $o\in\mathbb O$ (ie $a_{j}\in O^{-1}(o)$ ). Thus, given an original instance $x$ and a property $o\in\mathbb{O}$ on sequences, the set of serializations of $x$ bearing property $o$ is $O^{-1}(o)\bigcap X^{-1}(x)$ .

Let us demonstrate these notions using sets as our structure example. In figure 1 we show all possible serializations of sets with up to three elements. In the rectangle below each set we give all its serialisations. The properties correspond to an ordering which allows us to distinguish equivalent but different serializations of a given set (structure). In the example, all six serializations of $\{A,B,C\}$ are equivalent and are differentiated only by their ordering.

We want to learn the density of a given structure irrespective of the serializations choices (properties). We could marginalize out the properties. However this would be intractable for structures with large equivalence classes. Instead, we adapt equation 1 and integrate the choice of the serialization (its probability), controlled by the $o$ variable:

[TABLE]

$\mathbb{O}$ structures $\mathbb{A}$ and helps us count the distinct (but equivalent) serializations of a given $x$ ; it comes from domain knowledge. For example for a set with $n$ elements we know that we have $n!$ orderings of its elements, thus $n!$ distinct but equivalent serialisations.

We denote by $\mathbb{F}\subseteq 2^{\mathbb{X}}$ the set of events of $\mathbb{A}$ . We see the stochastic process installed by $X^{-1}$ as a sampling strategy defined by an arbitrary measure $\mu:\mathbb{F}\mapsto\mathbb{R}_{>0}$ . This arbitrary measure $\mu$ indicates the importance we put on a serialization with respect to other possible serializations.

Hence abstractly, a run of the serialization algorithm computes a set of serializations $X^{-1}(x)$ and picks one according to the weight measure $\mu$ for feeding the training set $\mathcal{A}$ . $\mu$ therefore comes as a support to the calculation of our normalizer $P(o|x)$ as

[TABLE]

In the set example above if we treat all orderings as equiprobable we have $\mu(X^{-1}(\{A,B,C\}))=3!$ and $\mu(X^{-1}(\{A,B,C\})\cap O^{-1}([B,A,C]))=1$ . In the next section we will exploit and modify this abstract modeling to enforce constraints on the serializations that reflect the invariance properties of the original structure. In practice, unless we have a good reason to do otherwise we will choose the uniform distribution for $\mu$ .

To compute $P_{\mathbb X,\boldsymbol{\phi}}(x)$ , we need $P_{\mathbb A,\boldsymbol{\phi}}(a)$ (eq. 2). We fit $P_{\mathbb A,\boldsymbol{\phi}}$ by maximum likelihood on $\mathcal{A}$ .

3.3 Serializations with structural constraints

We place serializations into $\mathcal{A}$ on the basis of their global occurrences (ie as complete serializations) and the sampling distribution $\mu$ . However, we do possess additional information on the structure which we will incorporate in our sampling procedure.

In the set example, since a set is order invariant, serializations $a_{3}=[{\sf B,C,A}]$ and $a_{4}=[{\sf C,B,A}]$ are equivalent, and so are their partial serializations $a_{3}^{[1:2]}=[{\sf B,C}]$ and $a_{4}^{[1:2]}=[{\sf C,B}]$ . We can explicitly model this equivalence by constraining the conditional probability of the next element of two serializations, given equivalent partial serializations, to be the same, i.e. $P_{\mathbb A,\phi}(a_{3}^{3}=b|a_{3}^{[1:2]})=P_{\mathbb A,\phi}(a_{4}^{3}=b|a_{4}^{[1:2]})\quad\forall b\in\mathbb B=\{\sf A,B,C\}$ . More generally, given two serializations $a_{i}$ and $a_{j}$ , the $t$ -length partial serializations of which are equivalent, we require that:

[TABLE]

Equation 4 transfers the structural invariances of the original instances $x\in\mathbb X$ to the serialisations $a\in\mathbb A$ produced from them. The constraints are enforced by identifying equivalent partial serialisations of $a_{i}$ and $a_{j}$ and ensuring that the probability distribution of the next element is the same. To do so we define a state space $\mathbb{S}$ and map sequences $\mathbb{A}$ on to it so that equivalent partial sequences have the same state. $\mathbb{S}$ is equipped with a transition function $f:\mathbb{S}\times\mathbb{B}\rightarrow\mathbb{S}$ governing the construction of a sequence via its equivalent states. Hence, a serialization $[a^{1},\ldots,a^{t},\ldots,a^{T}]$ is represented by the sequence of states $s^{0},\ldots,s^{t},\ldots,s^{T}$ produced by the recurrence:

[TABLE]

where $s^{0}$ is an initial state representing an empty sequence in $\mathbb{A}$ , and therefore an empty object in $\mathbb{X}$ (eg a graph with no node or an empty set). At any step, a partial sequence $a^{[1:t]}=[a^{1},\ldots,a^{t}]$ is represented by state $s^{t}$ . Hence,

[TABLE]

By combining equations 5 and 6, and imposing at step $t$ that the state $s_{i}^{t}$ of partial sequence $a_{i}$ is the same as the state $s_{j}^{t}$ of partial sequence $a_{j}$ , if the two partial sequences are equivalent, we model the equivalence relationship in equation 4 in the state space:

[TABLE]

Modeling with states enables the incorporation of structural constraints on the serializations. Such constraints create correlations between (sub-)serializations and prevent us from sampling serializations at once, as we did in the previous section. Even different instances $x$ may share substructures and thus share equivalent partial serializations. We thus need to enforce equivalence constraints across different instances. To do so we sample at the level of the serialization element. We adapt our sampling measure $\mu$ to reflect these equivalence constraints. In other words, we adapt $\mu$ to express how the next element $a^{t+1}$ is sampled from $\mathbb B$ with respect to the state $s^{t}\in\mathbb S$ . We redefine the measure as $\mu:\mathbb{S}\times\mathbb{B}\rightarrow\mathbb{R}_{>0}$ to provide a measure over the joint set of states ( $s^{t}$ ) and the lexicon from where $a^{t+1}$ will be sampled. The $\mu$ measure allows to prioritize orderings. Because we define $\mu$ on the states and not on the past sequences, we automatically ensure that our sampling follows the constraints.

In practice, defining an appropriate sampling strategy is difficult but this leads us to an interesting algorithmic solution. We give in Algorithm 1 of the appendix a procedure to efficiently sample a serialization, focusing on the structural constraints and compensating for the bias coming from the specificity of the base serialization algorithm $X^{-1}$ . In a nutshell, given a base set of serializations $\mathcal{A}_{\rm all}=X^{-1}(x)$ , the procedure samples, element by element, a serialization guaranteed to come from the base set according to the statistics of the base set and $\mu$ . At each time step $t$ , the procedure stores in the set $\mathcal{L}$ all elements $a_{j}^{t}$ as possible next elements for the currently reconstructed sequence $a_{\rm sample}^{[1:t-1]}$ . The next element $a^{\rm next}$ is then sampled from $\mathcal{L}$ using $\mu$ and $s$ and concatenated to $a_{\rm sample}^{[1:t-1]}$ (ie $a_{\rm sample}^{t}=a^{\rm next}$ ). In order to preserve the consistency of the sampled serialization, all serializations in $\mathcal{A}_{\rm all}$ not having element $a^{\rm next}$ as $t$ th element are removed from $\mathcal{A}_{\rm all}$ . This is repeated until the special $eos$ token is sampled. The final training set $\mathcal{A}$ is obtained by sampling the serialization $\mathcal{A}_{\rm all}$ for every training instances $x$ .

Note that for example, in the simple case of sets of size $n$ with uniform sampling, this decimation (equivalent to saying that every element may appear once only and at any position) will lead to every serialization (with no restriction on their property) having a probability of $\frac{1}{n!}$ , which is consistent with the reality. On the other hand, choosing a sampling strategy that always samples in alphabetical order will lead to a single serialization (with no restriction on their property) having a probability of $1$ which is also consistent with reality. So, except for biasing the learner for a particular serialization order, there is no reason to use a different distribution than uniform for $\mu$ . Note however that the final sampling strategy described in algorithm 1 is not uniform. Only the distribution of the next element given the current state is. States are then updated following equation 5. This procedure therefore makes very efficient the creation of an unbiased training set $\mathcal{A}$ informing the learner about the structural constraints within $\mathbb X$ .

3.4 Regularized learning

Having constructed $\mathcal{A}$ we use a RNN to learn $P_{\mathbb A,\boldsymbol{\phi}}$ . We can support the RNN learning by defining a regularizer based on the constraints of our domain. To this end, we will use the structural constraints we gave in equation 7 (to guide the generation of serialisations) to define a corresponding regulariser on the states that the RNN learns over these serialisations. We create a binary sparse matrix $\left(C_{jk}^{t}\right)$ ( ${j,k=1\dots|\mathcal{A}|},{t=1\dots T}$ ) storing state equivalences with the serializations $a_{j}\in\mathcal{A}$ (ie $C_{jk}^{t}=1$ iff $s_{j}^{t}=s_{k}^{t}$ ). Let $h_{t}\in\mathbb{R}^{H}$ be the $H$ -dimensional hidden state of the RNN. The probabilistic model of an RNN is given by:

[TABLE]

where $W_{hh}$ is the hidden-to-hidden weight matrix, $W_{hi}$ the input-to-hidden weight matrix and $P_{\theta}$ is a distribution with parameters $\theta$ . We will use S-RNN to learn generative (conditional and uncoditional) as well as discriminative models. In the former case the goal is to learn to generate the complex strucures. The learning objective is:

[TABLE]

where the first term is the loss and the second is the regulariser. When we learn a conditional generative model the loss term will also include a $t=0$ step which will be conditioning the $h_{j}^{1}$ hidden state on the conditioning variables. In the discriminative modelling, where the goal is to predict a target value ( $y_{j}$ ) from a complex structure, we use a discriminative loss defined over the last hidden state and the target variable replacing the log-likelihood term above with $\ln(P_{\theta}(y_{j}|h_{j}^{T}))$ .

The regulariser enforces exactly the same bias in learning that we used to generate the serialisations. Given the, often, combinatorial nature of the sequence generation we will have very large sample sets over which we will train; in such cases the utility of regularisation is limited, if any. A fact that we also confirm in our experiments.

3.5 Recovering the density on the original structures

Given a trained model we want to compute the probability of an instance $x_{\rm test}$ and obtain $P_{\mathbb X,\boldsymbol{\phi}}(x_{\rm test})$ using equations 2 and 3. Additionally, in practice we know what serialization algorithm we use, and also know its properties (ie the properties of the serializations it produces). We use this knowledge to compute $P(o|x_{\rm test})$ without generating all serializations in $X^{-1}(x_{\rm test})$ and also use the same algorithmic enumeration of serializations as proposed in Algorithm 1 to estimate their specific probability of occurrence.

Hence, given one serialization $a_{j}$ of $x_{\rm test}$ with property $o_{j}$ , we access its learned probability $P_{\mathbb A,\boldsymbol{\phi}}(a_{j})$ . We can extract the number of serializations $a$ bearing the same property $o_{j}$ and therefore easily compute $P(o_{j},x_{\rm test})$ . Similarly, by normalizing serializations for their probability of occurrence, we get $P(o_{j}|x_{\rm test})$ and therefore are able to compute an estimate of $P_{\mathbb X,\boldsymbol{\phi}}(x_{\rm test})$ via $a_{j}$ . Generating $m$ equivalent serializations of $x_{\rm test}$ by using $m$ times algorithm 1 to obtain $(a_{j})$ with $j=1,\ldots,m$ (each bearing property $o_{j}$ ), we can improve the accuracy of the estimate by taking the expectation so we finally get:

[TABLE]

4 Experiments

We experiment on a set of learning problems where learning instances have diverse structures, sets, trees, graphs, multi-variate times-series. We learn generative (conditional and unconditional) models and discriminative (classification and regression) models and show that S-RNN achieves comparable performance with state of the art baselines that have been specifically tailored to these structures.

Set problems

We use S-RNN in a discriminative manner to solve a classification problem over sets. We serialize a set to a sequence using a random ordering of the elements of the set. The task is to classify a 3D model of an object represented as an infinite set of 3D points (the surface of the 3D model). We use the ModelNet10 and ModelNet40 datasets, [30], which have different objects from ten and 40 different object types. After preprocessing of the points, we obtain a set of 6 dimensional feature vectors ( $\mathbb{B}=\mathbb{R}^{6}$ ) from which we randomly sample $T$ points. We add a classifier to the last state for the final classification. We experiment with and without our regulariser ( $\lambda=1,\lambda=0$ respectively) and the results are identical, showing, as expected in our setting, that the regulariser cannot bring performance improvements.

We summarize the results in table 1. With $T=500$ points our model achieves the same performance as Deep-Set, specifically designed for set problems.

Moving to larger sets ( $T=5000$ ), and thus larger sequences, does not improve the performance a rather known fact with RNNs.

For completeness we provide the best results on these datasets obtained by RotationNet [17]. However we should note that the RotationNet’s model is not based on the concept of a set, it rather uses CNNs on multiple learned views of the objects. For a more detailed discussion on the experiments see section B.1 in the appendix.

Tree problems

We experiment with two learning tasks where instances are trees, ordered (the childrens’ order matters), or unordered (order does not matter). We serialize the former by traversing their nodes from the root to the leaves; since here the order is important, we cannot use randomization. We serialize unordered trees in the same way but now in addition we randomize the childrens’ order. We experiment with two learning problems. In the first we use S-RNN to learn a conditional generative model that generates an unordered tree given its textual description. We compare against the DRNN baseline [1]. We evaluate the conditional tree generation as a node and edge retrieval task using precision, recall and F1 score. For lack of space we give a complete description of the setting and the results in appendix section B.2, table 4. S-RNN outperforms the baseline by an important margin for all measures except recall. In the second learning problem we use S-RNN to learn a discriminative model that predicts a scalar given a tree. Here we compare against the results of two Tree Echo State Network variants, TreeESN-R and TreeESN-M [10]. We consider both ordered and unordered trees. S-RNN, trained on ordered and unordered trees, gives better results than TreeESN-R, but performs worse than TreeESN-M. For detailed results and discussion appendix section B.2.

Graph/Molecule problems

We evaluate S-RNN on regression problems in which instances are graphs that describe molecules. We serialise the molecular graph into SMILES strings [37]. We experiment with canonical (non-randomized) and non-canonical (randomized) SMILES. The latter is the default serialization we use in S-RNN. We experiment on the QM9 dataset from the deepchem benchmark [26] and compare against a number of relevant baselines. In the appendix we also include additional results on a set problems defined on the Guacamole dataset [6]. For both datasets we use S-RNN to learn a discriminative model for regression. We did not use the regularizer ( $\lambda=0$ ). We evaluate the predictive performance reporting the squared Pearson coefficient on the test set. Our results, table 2, show that the use of non-canonical SMILES outperforms the canonical representation and solves nearly perfectly all tasks, with the exceptions of mu. It also outperforms all baselines with the exception of the one that uses the 3D position information of the atoms. Note that we do not use such information in the two representations with which we experimented. For detailed descriptions of the experiments and results see appendix, section C.

Multivariate dynamical systems datasets

We explore the performance of S-RNN on multivariate dynamical systems problem where the instances are essentially multi-variate time series. We serialise these multi-variate time series variable by variable using as features the combination of variable id and value. The order of variable serialisation is random. We indicate time advancement with a dedicated symbol. We report here the results on a real world dataset, gait, which contains recordings of gait trajectories of people with pathological gait. The goal is to learn to generate the evolution of different joint angles in time; we experiment with different number of angles (1, 2, 4, 8). In addition to that dataset we also experiment with an artificial dataset generated using a known dynamical system; for lack of space we report these results in the appendix. On the gait dataset we use S-RNN to learn a conditional and an unconditional generative model. In the former we generate a multivariate time-series (gait) given some patient specific input feature in the latter we generate plausible gaits. As baseline we use a standard RNN that has the same architecture as our model. In the unconditional generation the S-RNN outperfrorms the RNN in a significant manner for all the number of angles with which we experimented. In the case of conditional generation each model has one significant win (table 3). We also did experiments in order to study the effect of the regulariser (cf. table 9 of the appendix). As we also saw in the set experiments, its use does not bring significant performance improvement. We believe that this happens because the randomization provides a considerable part of the information that is also exploited in the regularizer. For more details on this set of experiments see section C.1 of the appendix.

Randomization and overfitting

An important part of S-RNN is the randomized sampling procedure we used to generate the serialisations of a given complex structure. Our experimental results, which we will very briefly discuss, show that this offers a significant protection from overfitting. We generated the learning curves, on both the training and testing sets, as a function of the training epochs for the different structures with which we experimented (figure 2). In the experiments with sets we see that the test performance follows a similar evolution to that on the training set and they never diverge. A similar picture also appears in the tree problems. In the graph problem we plot the learning curves on the canonical (non-randomized) and non-canonical (randomized) SMILES. We see that in the randomized SMILES the learning curves on the training and test exhibit very similar relative behaviors. This is not the case for the canonical SMILES where very early the train and test learning curves diverge. Without randomization, the model may overfit on a particular SMILES string. With randomization the additional complexity of all possible equivalent SMILES strings forces the model to generate a representation which is compatible with the randomization procedure. In the case of the dynamical systems problem we compare the learning curves of S-RNN with a multivariate RNN. As we can see there is no divergence between the performance on the training and the testing set for S-RNN. When it comes to RNN the overfitting is very severe and happens rather early. We give in the appendix extensive results on the learning behavior of S-RNN on the different structures.

5 Conclusion

We have presented S-RNN, a generic framework for performing density estimation over arbitrary complex data structures. S-RNN achieves a performance on a number of different structures that is comparable or better than the performance of methods specifically developed for these structures. Our genericity hinges on the existence of a serialization/de-serialization procedure for the data structures in question. The knowledge and control of the serialization operations allow us to transfer the structural properties of the original data onto generic sequences over which an RNN will learn. The combinatorial nature of the serialization provides us with a potentially unlimited set of training samples. We show empirically that the diversity and uniqueness of such samples provides strong protection against overfitting. In fact the combinatorial nature of our serialization renders the regulariser that we have introduced useless, since the multiple serialisations force the learned model to factorise states in a similar manner to the regulariser.

Appendix A Algorithm

As explained in section 3.3 the main goal of the algorithm is to sample a serialization $a$ for a given instance $x$ with importance of a serialization with respect to others given by $\mu$ . Because $\mu$ depend on the state, we need to construct the serialization element by element however we also need to ensure that the deserialization of the constructed serialization is $x$ . Our strategy is the following:

List all possible serializations. 2. 2.

For every time step $t$ do:

(a)

Compute the set of possible next element. 2. (b)

Sample the next element from the set of possible next element. 3. (c)

Update the state and the list of all possible serializations.

Theses steps are explained in details in our pseudo-code 1 and 2. Note that in practice for common structures we do not need to compute the set of all serializations. It is only needed to know the set of next element that create a serialization of the instance $x$ we are interested on. Because a majority of common structures have a serialization algorithm which is recursive in nature this can be much faster to implement than the general algorithm.

Appendix B Detailed descriptions of experiments and results

B.1 Set problems

We use S-RNN to solve a classification task over sets. The task is to classify the 3d model of a shape represented as a set of points. We experiment with two datasets ModelNet10 and ModelNet40, [30], which respectively contain 10 and 40 objects types, such as airplane, xbox, stair, car, …. Each object has a minimum of 100 different instances. We use the official split into train and test sets. The total number of instances is 4896 and 12312 respectively.

We normalize the data into the unit cube using an homogeneous scale and a translation. We represent the points with a six-dimensional vector given by their $x,y,z$ coordinates and their squares. We added the square as feature because it allows to convert the data into cylindrical and spherical coordinate by a linear transformation. We believe that some important features are easier to compute in cylindrical/spherical coordinate. In addition, we added a learned feature extractor as an MLP.

To sample the points of a given object, we sample uniformly points on the surface of its model. This is done by first representing the 3d model as a set of triangles. We then compute the surface of the 3d model by summing the contribution of every triangle. We then sample points by first sampling a triangle proportionally to its surface and then uniformly sampling a point on the surface of the triangle.

The resulting serialization strategy is a sequence of the sampled points. Note that the number of possible serializations is infinite when neglecting floating point precision as there is an infinite amount of points that can be sampled on a triangle.

After tuning on the validation set build by using 100 instances we removed from the training set from every class, we set to the following model architecture. The tuning of the network hyper-parameters was done in the set of $[16,32,64,128,256]$ .

As feature extractor, we use a 64 unit MLP with a single hidden layer transforming the 6 dimensional feature to a 64 dimensional feature which will then be input to the GRU.

To learn of the serialisation we use a GRU[7] with 64 units over which we add a single layer with 64 hidden units to do the classification. As optimizer we use Adam[18] with a learning rate of $10^{-4}$ together with gradient clipping to the range of $[-5,5]$ . We also experimented with curriculum learning[4] by starting with a length of 10 samples and increasing by 4 the number of sample every epoch. This did not change the result as can be seen in table 1.

The central result of our paper of generalization on the structure can be seen in figure 3 where we see that even if in training and testing there is an infinite amount of instances, our model is able to express and learn the correct concept. We also see that using or not the regularizer does not change the result.

In this experiment, we can dissociate the problem of understanding what the structure of set is from the problem of classifying. Our model generalizes almost perfectly on random samples of a given set as it can be seen by its training results on instances that it has never seen. However, it does not capture well what associates a 3d model with its respective label. This association can be described by notions such as translation invariance and local feature extractors as the ones learned by CNNs, which are in fact the methods that achieved the best performance in the classification of 3d cloud points.

B.2 Tree problems

B.2.1 Data structures in tree datasets

We experiment with S-RNN on two different learning tasks involving instances that are represented as trees. We consider ordered and unordered trees. In the former the order of the child nodes is important while in the latter it is not. Ordered trees are mostly used in NLP tasks while unordered are used in graph modeling. The first task is a conditional tree generation task in which the goal is to generate a tree given itS textual description. Thus here a learning instance is a $\mathbf{x},\mathbf{Y}$ , pair where the predictive component $\mathbf{x}$ is a sequence and the target $\mathbf{Y}$ is a tree. The second task is a regression task where the goal is to predict a scalar given a tree, i.e. learning instances are now of the form $\mathbf{X},y$ , where $\mathbf{X}$ is a tree and $y\ \in\mathbb R$ . As is customary in (generative) tree modeling we make the assumption that the probability distribution that governs the generation of a given node is a function of its parent nodes and its so-far seen siblings. For the conditional tree generation task we use the synthetic data and the evaluation code of [1]. The goal is to predict the topology of an ordered tree given only its nodes sequence as this is produced by a depth first traversal of the tree and no topological information. Node labels are taken from the 26-letter alphabet ${\sf T}_{1}=\{{\sf A},{\sf B},\dots\sf X,{\sf Y},{\sf Z}\}$ . We use the train/validation/test set separation of the original paper, i.e. 4000 training, 500 validation, and 500 testing instances. The tree sizes vary considerably with the smallest trees having only a single node and the largest ones 20, with the average number of nodes being 4. For the regression task the goal is to predict the boiling point of alkane molecules ([10]). Here we consider both ordered and unordered trees. The node labels are taken from the set of ${\sf T}_{2}=\{{\sf C},{\sf CH},{\sf CH2},{\sf CH3},{\sf CH3F},{\sf CH4}\}$ where each label indicates how many hydrogen atoms are linked to the carbon atom. Here the number of nodes per tree vary from one to ten, with an average of five. The dataset has 150 learning instances. We estimate the performance by averaging the performance estimates over three hold-out sets, where the size of the hold out is 20. From the remaining 130 instances we use 100 for training and the remaining 30 for parameter tuning.

B.2.2 Serialisation for tree problems

We now describe how the serialization algorithm treats the learning instances, i.e. the ( $\mathbf{x},\mathbf{Y}$ ) or ( $\mathbf{X},y$ ) pairs, starting by describing the serialisation of a tree. In serialising tree structures the dictionary $\mathbb B$ contains only categorical elements, and in particular it is the set $\{{\sf(},{\sf)}\}\cup\text{NL}$ , where a ${\sf(}$ indicates that the next element of the serialisation will be the children of the current node, ${\sf)}$ indicates that we have completed the list of children of the current node, NL is the set of all node labels for the given tree problem, i.e. ${\sf T}_{1}$ for the tree prediction problem and ${\sf T}_{2}$ for the boiling point prediction problem. To produce the serialisation we traverse the tree in a depth first manner and add elements to the serialisation as we move from node to node. For ordered trees the order of traversal of the children of a node is the same as the one given by the tree, i.e. there is no randomness here. For unordered trees the order of traversal is random, i.e. $\mu$ is uniform over the non-selected children. To give an example, for the ordered tree with root ${\sf A}$ and two child nodes ${\sf B}$ and ${\sf C}$ its unique serialisation will be $[{\sf A},{\sf(},{\sf B},{\sf C},{\sf)}]$ . If the tree is unordered then it will have two possible serialisations $[{\sf A},{\sf(},{\sf B},{\sf C},{\sf)}]$ and $[{\sf A},{\sf(},{\sf C},{\sf B},{\sf)}]$ . The state $s$ associated with the given partial serialisation we generate just before arriving at some node, $k$ , of a tree will be given by the sequence of the parent nodes of $k$ and its so far-seen siblings, i.e. it does not depend on the children of its seen siblings. This state representation reflects the main assumption in tree modeling, mentioned above, i.e. that the generative distribution of a node is a function of only its parent nodes and its so-far seen siblings. We only use this state representations when we want to impose the structural constraints regulariser. The tree serialisation is one component of the learning instance serialisation. In the tree prediction problem we need to serialise $(\mathbf{x},\mathbf{Y})$ pairs, where $\mathbf{x}$ is the node label sequence of the depth-first tree traversal. Since here $\mathbf{x}$ is already a sequence there is no serialisation involved for it. In addition since the elements come from ${\sf T}_{1}$ we do not even need to extend the dictionary $\mathbb B$ since the node labels will be already in. However we prefer to use a different label set, ${\sf T}_{1}^{\prime}$ , for the elements of the input sequences in order not to provide to the algorithm the domain knowledge about the correspondence of the building blocks of the sequences and the trees. This makes the problem more difficult since the algorithm will now need to learn these correspondences. Thus the final dictionary is $\mathbb B=\{{\sf(},{\sf)}\}\cup{\sf T}_{1}\cup{\sf T}_{1}^{\prime}$ . The sampling measure $\mu$ we use to serialise a $(\mathbf{x},\mathbf{Y})$ learning instance randomly selects to include first in the serialisation the $\mathbf{x}$ component half of the times while the other half it first serialises the tree $\mathbf{Y}$ . Essentially we are feeding the model with samples from both $P(\mathbf{Y}|\mathbf{x})$ and $P(\mathbf{x}|\mathbf{Y})$ distributions and the learning algorithm learns associations between their individual building blocks, learning eventually the complete joint distribution $P(\mathbf{x},\mathbf{Y})$ . For the regression task where the learning instances come in the form $\mathbf{X},y$ the dictionary is now given by $\mathbb B=\{{\sf(},{\sf)}\}\cup{\sf T}_{1}\cup{\sf t}\cup\mathbb R$ . It thus includes also real value elements, since these are used for the target variable. The label ${\sf t}$ stands for target and it will always be followed by a scalar, describing thus the target value $y$ for the given training instance. As in the tree prediction problem the sampling measure $\mu$ selects randomly in half the serialisations the $\mathbf{X}$ component first and in the other half the $y$ component.

B.2.3 Learning architecture for tree problems

On the conditional tree generation task we compare our method against DRNN introduced in [1] using the authors’ code and their evaluation protocol. DRNN use two different hidden state vectors a fraternal and an ancestral. The fraternal hidden-state models the evolution of the state with siblings and the ancestral models the relation between parent and child. This relation is modeled with two types recurrence: one between parent and child, and one between siblings. The hidden state is then used to predict the topological information (if we grow a new branch) and the label information. The evaluation protocol treats the task as a retrieval problem quantifying the quality of the recovery of the nodes and edges of the original tree. We use the same learning architecture as the one described in C.1 with small differences in the number of hidden units and layers. Concretely we use a two-layer LSTM with 512 units followed by a two-hidden layer network that predicts the categorical component and another two hidden layer that predicts the parameters (means, variances and mixture coefficients) of 6 Gaussian mixtures. In the two latter networks the number of hidden units is tuned on the validation set from $2^{i}:i=5\ldots 10$ . The $\lambda$ parameter of the structural constraints regulariser is also tuned from the set $(0,1,10,100)$ . We report choose the best model on the validation set and report the testing error. As before we use ADAM for optimization. We use a mini-batch size of 32 instances for DRNN. For S-RNN a mini-batch contains 64 serialisations which are generated from 32 instances.

B.2.4 Tree results

We give the results in table 4 for the conditional tree generation task. S-RNN outperforms DRNN by a large margin, a method specifically developed to learn with trees, both for the F1 and precision measures for nodes and edges. It fairs worse for node and edge recall. DRNN has a perfect recall, at the cost of generating trees which have many superfluous elements.

In the regression task we compare S-RNN against two variants Tree Echo State Network, TreeESN-R and TreeESN-M, [10]. Both TreeESN methods are reservoir computing models which generalize the reservoir computing paradigm to tree structured data. The difference between the two variants is on how they aggregate the state vectors to represent the complete tree. The R variant only uses the state of the root whereas the M variant averages over all states of the tree. We experiment with ordered and unordered trees. The evaluation error is the mean absolute error. We give the performance results in table 5 (average predictive error). The two variants of S-RNN, i.e. trained on ordered and unordered trees, give better results than TreeESN-R , while they perform worse than TreeESN-M. The performance of all methods is quite remarkable given that the scalar values to predict range from $-164\text{\,}\mathrm{\SIUnitSymbolCelsius}$ to $174\text{\,}\mathrm{\SIUnitSymbolCelsius}$ .

In order to check the behavior of S-RNN with respect to overfitting we also plot the evolution of the loss in the train/validation/test sets in figure 4 as a function of the training epoch number. When it comes to the conditional tree generation task and the tree regression task with the unordered trees, there is hardly any divergence between the training, validation, and test losses. In the case of ordered trees we do observe an important divergence starting from around the 20th epoch. In serialising an ordered tree there is no randomness since we have to respect the order, thus an ordered tree has a single serialisation, contrary to the unordered ones which have multiple serialisations. Exposing the learning algorithm to multiple random, but equivalent, serialisations provides clear benefits in terms of protection against overfitting.

Appendix C Graph/Molecule problems

We experiment with S-RNN on a set of regression tasks and datasets where instances are graphs. We use the QM9 dataset from deepchem [26] benchmark and the Guacamole dataset from [6]. In both cases the goal is to predict a number of properties from the molecule structure. For the QM9 dataset, the regression problems/targets, are given in the dataset and are: mu, alpha, HOMO, LUMO, gap, R2, ZPVE, Cv, U0 u298, h298, g298. For the Guacamole dataset, we computed the following targets using RDKit: logP, mol_weight, num_atoms, num_H_donors, tpsa.

In the publicly available versions of the two datasets the molecules are represented by their canonical SMILES strings [37]. Our randomised serialisations are generated by exporting the SMILES strings using RDkit [21] with the randomize option on. The vocabulary of our alphabet are the symbols of the SMILES strings and we use a one hot encoding. We measure error with the squared Pearson correlation coefficient. As before we learn over the sequences using a GRU[7] with 128 units, and use over it a single layer with 128 hidden units to solve the regression task. We did not use regularisation ( $\lambda=0$ ). As optimizer we use Adam[18] with a learning rate of $10^{-4}$ .

To compute the baseline results on the QM9 dataset we used the benchmark script from deepchem [26] with the default options. We give the complete results in tables 6 and 7. Below is a short description of the baselines we used for the QM9 task:

tf regression

MultitaskDNN is a standard MLP that predicts multiple tasks.

tf regression ft

Fit Transformer MultitaskDNN is a variant of the previous which in addition does a binarization and transformation of the input feature. We used the best domain knowledge guided transformation on this dataset as determined by the benchmark author[26].

graph conv reg

Graph convolution regression is a variant of graph convolution which produces a fingerprint of the molecule that is then used by a classifier.

weave regression

Weave is a variant of graph convolution which does the convolution on the whole molecule.

dtnn

Deep Tensor Neural Network uses as an additional feature the 3D coordinates of the atoms. Using these coordinate an update mechanism based on the distance matrix and the neighborhood is used. This is the only model that uses the 3d coordinate.

For the Guacamole dataset, we did not found an existing result in a regression settings. All baselines we found were in the setting of generation. We decided to still included the results of Guacamole to show the scalability of our method. We give the complete results in table 8.

An interesting particularity of the mu task in QM9 is that all models that use only the graph and no 3D position information perform signigicantly worse compared to dtnn which uses the 3D position information. For the Guacamole experiment we see that we almost perfectly predict the target. This is not too surprising as the target was computed with RDkit and is quite simple. From these experiments we see that even with a simple sequence model we can have state of the art performance on learning tasks as complex as molecule properties prediction.

The learning curves on the mu task of QM9 on the canonical and non-canonical SMILES provide us with an eloquent demonstration of the benefits of the randomization (figure 5). There we see that in the randomized SMILES the learning curves on the training and test exhibit very similar relative behaviors even after 6k epochs. This is not the case for the canonical smiles where very early the train and test learning curves diverge. Without randomization, the model may overfit on a particular SMILES string. With randomization the additional complexity of all possible equivalent SMILES strings forces the model to generate a representation which is compatible with the randomization procedure.

C.1 Multivariate dynamical systems datasets

C.1.1 Data structure of multivariate dynamical systems

We explore the performance of S-RNN on two datasets arising from multivariate dynamical systems (artificial and real-world). We generate the artificial dataset using a known dynamical system. The real world dataset contains recordings of gait trajectories (joint angle values) of people with pathological gait. Both datasets have a similar structure and differ only by their dimensionalities and cardinalities. In particular every training instance consists of two components: $\mathbf{x}\in\mathbb R^{d}$ , which we call input, and $\mathbf{Y}\in\mathbb R^{k\times l}$ which we call output; with the latter being a probabilistic function of the former. We will denote the $i,j$ element of $\mathbf{Y}$ by $y_{ij}$ , the $j$ column by $y_{.j}$ and the $i$ row by $y_{i.}$ ; functions $l(y_{i.})$ and $l(x_{i})$ return the name of the feature they take as argument. The $\mathbf{Y}$ matrix contains a $k$ -dimensional dynamical system uniformly sampled at $T$ time points. We solve two types of tasks. A conditional generation task in which the goal is to learn the conditional density $P(\mathbf{Y}|\mathbf{x})$ and use that for sampling and prediction and an unconditional generation task in which we seek to learn $P(\mathbf{Y})$ and sample from it. In both cases we measure performance with the negative log-likelihood.

C.1.2 Serialisation of multivariate dynamical systems

We now describe the concrete serialization structure that the serialization algorithm produces for a particular $\mathbf{Y}$ matrix and a $\mathbf{x},\mathbf{Y}$ , couple. Our dictionary $\mathbb B$ contains two types of elements, categorical and real valued. The domain of the categorical elements is $\{l(x_{i})|i:=1...d\}\cup\{l(y_{i.})|i:=1...k\}\cup\{t+\}$ , i.e. the names of the features of the $\mathbf{x}$ and $\mathbf{Y}$ components and $\text{t}+$ ; the latter denotes a shift from a column of the $\mathbf{Y}$ matrix to the next one, essentially it corresponds to moving to the next element of a multi-variate sequence. The real valued elements are the values of the features. Within a serialisation a categorical element is always coupled by a real value. A feature name is coupled by the respective feature value and $t+$ is always coupled with zero. The categorical elements are encoded with a one-hot vector.

When serializing matrix $\mathbf{Y}$ and currently at column $j$ the serialization algorithm randomly chooses among the features that have not yet been added which one to add. Thus $\mu$ is uniform over the non-selected features. Once all features of the $j$ column have been sampled then the $t+$ operator is selected as the next element of the serialisation, and the serialization algorithm proceeds with the serialisation of the next sequence element. When we serialise an ( $\mathbf{x},\mathbf{Y}$ ) couple the sampling measure $\mu$ is now different. In half of the cases we first select all elements of the $\mathbf{x}$ component to be added to the serialisation before moving to the serialisation of the $\mathbf{Y}$ component. In the other half sampling between the $\mathbf{x}$ and $\mathbf{Y}$ components is uniform, i.e. $\mathbf{x}$ and $\mathbf{Y}$ features can be interleaved. Nevertheless the serialisation order of $\mathbf{Y}$ is the same as before. We bias serialisation towards selecting first the $\mathbf{x}$ components because we want to sample and learn the conditional distribution $P(\mathbf{Y}|\mathbf{x})$ thus the conditioning component should appear first in the serialisation. However, we still allow for a uniform sampling between the $\mathbf{x}$ and $\mathbf{Y}$ components in half of the cases so that the learner will have more chance to pick up on correlations between parts of the input and parts of the output. All serialisations are generated on the fly during training.

C.1.3 Learning architecture for multivariate dynamical systems

We describe the learning architectures we use. Note that these architectures are essentially the same for the baseline learning algorithms (against which we will compare) and our algorithm S-RNN. The architectural differences are only the result of the structure of the training data. In the case of the baseline algorithms these are either standard vectorial data, i.e. here the $\mathbf{x}$ component, or a $k$ -dimensional sequence, i.e. the $\mathbf{Y}$ component. In the case of S-RNN the training data are the serialisations/sequences produced from a given training instance $\mathbf{Y}$ or ( $\mathbf{x},\mathbf{Y}$ ), where each serialisation element is a couple with a categorical component and its respective real value.

We first describe the baseline architecture. We model the probability of the next $k$ -dimensional element in a sequence given the current state as a $m$ -component mixture of Gaussians the parameters of which we learn. Both for the unsupervised and supervised case the core architectural element is a multivariate LSTM. For the unsupervised setting we use a two-layer LSTM ([15]), with 128/256 units in each layer for the artificial/gait datasets respectively, followed by a one hidden layer neural network with 128/64 units for the artificial/gait datasets respectively. The network is fed sequentially with the $k$ -dimensional sequence of the $\mathbf{Y}$ matrix and predicts the means, covariance matrices, and mixture weights of the Gaussian mixture (thus its output is of dimensionality $m\times(k+k\times k)+m$ ), which provides the conditional distribution of the next sequence element. For the artificial data the mixture has only 1 Gaussian component and for the gait data it has 6. For the supervised setting we use an encoder-decoder architecture built on top of the architecture we just described ([32]). The encoder part has the same architecture as the two-layer LSTM we just described and is fed with the $\mathbf{x}$ component, i.e. a single element sequence. The hidden states and cell states of the two layer encoder are fed to the respective states of the decoder which itself also has the same two layer architecture and as in the unsupervised case feeds to a single layer neural network. All dimensionalities are the same as before.

For S-RNN since each element of the serialisation has a categorical component and a continuous one we need to adapt the learning architecture for that structure. We use exactly the same architecture for the supervised and unsupervised experiments since there is no change in the serialisation structure between the two experiments. To adapt the baseline architecture we described in the previous paragraph to the particularities of the serialisation structure we add one more one hidden layer network which is fed by the output of the two-layer LSTM and together with a soft-max layer model the conditional probability of the categorical part of the next element in the serialisation. The continuous component is predicted using the same architecture as the one we describe before to predict the $k$ -dimensional element of a sequence, with the only difference that since it is a scalar the output of the network will have $m\times(1+1)+m$ outputs predicting the mean, variance and mixture weights of the $m$ component Gaussian mixture.

We optimize all architectures using Adam ([18]). We use a mini-batch size of 32 instances for the baseline methods. In the case of S-RNN a mini-batch contains 64 serialisations which are generated from 32 instances.

C.2 Artificial dynamical system

We use a couple of Van der Pol equations linked to an harmonic oscillator to generate the artificial dynamical system. The coupling creates correlations between the variables which the learning process needs to learn. Here the dimensionality $d$ of $\mathbf{x}$ is 9, and the dimensionality of $\mathbf{Y}$ is $3\times 21$ . Given an input $\mathbf{x}$ , its matrix $\mathbf{Y}$ is generated by:

[TABLE]

The input vector $\mathbf{x}$ contains the initial conditions of the dynamical system and the values of its parameters, $y_{1}(0)$ , $y_{2}(0)$ , $y_{3}(0)$ , $\dot{y_{1}}(0)$ , $\dot{y_{2}}(0)$ , $\dot{y_{3}}(0)$ , $k-3$ , $\mu_{y_{2}}-3$ and $\mu_{y_{3}}-3$ . These are generated randomly for each ( $\mathbf{x},\mathbf{Y}$ ) pair. We generated 3000 instances of length 21 which we divided equally to training, validation, and testing sets. We train for 12 hours or until the validation error becomes larger than the validation error of the first iteration, which in the case of the baseline happens very often. We then select the model with the lowest validation error and apply it on the test set to compute the conditional negative log-likelihood. With the artificial dynamical system we only experiment in the conditional generation setting; the generated $\mathbf{Y}$ component is the one that maximizes the $P(\mathbf{Y}|\mathbf{x})$ conditional likelihood.

In the left part of figure 6 we give the evolution of the conditional log-likelihood on the validation as a function of the number of epoch seen. The most striking observation is that S-RNN never overfits; this is even more clearly demonstrated in the middle graph of the same figure where we give the evolution of the likelihood on the train and test set for both S-RNN and RNN. The standard RNN starts overfitting after around 160 epoch. S-RNN practically will never see an instance twice due the combinatorial complexity of the serialisation generation and can keep on training practically forever and no overfitting. As we can see in Table 9 S-RNN with no regularisation achieves the best result, far better and significantly better than the baseline; we controlled the statistical significance using a t-test. Mildly regularising S-RNN does not seem to bring any performance gain, while strong regularisation harms. The fact that regularisation does not bring any effect can be explained by the fact that the algorithm never sees twice the same serialisation and thus there is no overfitting problem.

In order to inspect the visual quality of the predictive results we give in figure 7 for a given $\mathbf{x}$ component the three components of the output sequence which has the maximum conditional probability $P(y_{1}|\mathbf{x})$ for RNN and S-RNN. As it is obvious S-RNN produces sequences of better quality, closer to the real sequence. serialization algorithm has different predictions as a function of the different serialisations of the $\mathbf{x}$ component.

C.3 Gait data

The gait dataset contains data for 806 patients. Every patient has an $\mathbf{x}$ component which is a 212-dimensional vector describing clinical properties of the patient, related to their body geometry and articulation flexibility. The $\mathbf{Y}$ component is an 8-dimensional sequence with 34 observations. The sequence describes a complete gait cycle of the patient, uniformly sampled at 34 points. Each one of the dimensions is an angular measurement on a joint of the patient. For each patient we have on average 6 gait cycles, giving a total of 4680 cycles. We decided to define learning instances on the level of cycles, thus we have a total of 4608 instances, all of which have an $\mathbf{x}$ and $\mathbf{Y}$ . As a result patients can appear multiple times (depending on how many cycles they have), their $\mathbf{x}$ component is always the same. When dividing in training, validation and testing sets, we took care to put all instances of a given patient only in one of the three sets. The training set contains 408 patients and their 3276 cycles, the validation set contains 16 patients and 1404 cycles, and the testing 382 patients and 1404 cycles. The stopping rule is the same as in the artificial dynamical system.

We first report the results on the unconditional generation in which our goal is to learn a model of $P(\mathbf{Y})$ . In table 10 we give the negative log-likelihood on the test set for different dimensionalities of the gait sequence. As it is clear S-RNN achieves a performance which is always considerably better than the RNN baseline.

In figure 8 we give examples of samples generated from S-RNN, RNN and real gait cycles respectively. Although the graphs are not conclusive it seems that S-RNN preserves more of the real gait cycle structure, while the ones generated from RNN seem to have a more random structure.

In the conditional generation we experimented with one and two angles. This time when it comes to one angle S-RNN is significantly worse compared to RNN. The situation is reversed when we consider two angles. We hypothesize that the low performance in the one-angle setting is because most of the network representation power is consumed in learning and expressing correlations between the input features. With two angles we are able to learn and express correlations between the angles themselves, thus the better performance.

As with the artificial dataset we also check the behavior of S-RNN on the two angle dataset with respect to overfitting by visualising the evolution of the negative log likelihood in the training and testing set as a function of the number of epoch, right graph in figure 6. As it was also the case with the artificial dataset we never observe a divergence between the performance on the training and the testing set, in fact here we even S-RNN train for 1000 epoch, point to which we stopped without observing any divergence between the two losses. When it come to RNN the overfitting is very severe and happens again around 100 epoch.

To visualise the quality of predictions and how they are affected by the serialisation of $\mathbf{x}$ which we need to feed to S-RNN in order to generate the $\mathbf{Y}$ component we give in the left part of figure 9 the different predictions we get for one angle and the different permutations of the $\mathbf{x}$ vector. As we can see the predicted gait curves are globally consistent and rather similar to the true gait curve. Finally in right part of figure 9 we give the multiple gait cycles of a single patient, the different predictions produced by S-RNN using different serialisations of $\mathbf{x}$ and the prediction produced by RNN. Again it is clear that the predictions generated by S-RNN are much more consistent to the true gait structure compared to the ones generated by RNN, which is considerably off from the true data structure.

Appendix D Hardware infrastructure

The results in this paper were computed on a variety of hardware. As GPUs we used:

•

Geforce 980.

•

Geforce 1070.

•

Geforce 1080.

•

Titan xp.

•

P100.

•

RTX 2080 ti.

As CPUs we used:

•

i7-5820K.

•

i7-7700HQ.

•

i9-9900K.

•

i9-9820X.

•

E5-2630.

The runtime of the experiments we present in the paper are at maximum of 2 days with a single gpu. In some of the preliminaries studies we did use longer running time.

Appendix E Model Complexity

There are 4 components in our model that have a time complexity.

The computation and back-propagation of the RNN. 2. 2.

The computation and back-propagation of the Constraint regularizer on the state. 3. 3.

The serialization of the instance together with the computation of the state. 4. 4.

The computation of which states of the mini-batch are the same.

The two back-propagation steps are done in training and cannot be pre-computed. They are in the critical path of the algorithm. Whereas the two last steps are pre-processing steps which can be computed asynchronously and in parallel. Theses steps are also computed on cpus. By using enough cores of cpus theses two last step have no influence in the training time.

The complexity of an RNN is linear with the length of the RNN which in our case is given by the complexity of a single data instance. The complexity of the regularizer is also linear with the length of the serialization.

The complexity of the serialization of an instance was for all the structures we considered linear with the length of the serialization.

Finally, the complexity to find which states are equivalents is linear with the length of the serialization. However, to obtain this linear complexity we need to use an hash table on the set. Defining this hash function for every structure may not be easy.

In conclusion, the complexity of our model is not very different of the one of an RNN.

Appendix F Software infrastructure

The first version of our code was implemented in torch[8] and c++[16]. The current version is based on PyTorch[25] and c++. Additionally to manipulate the different datasets we used the following libraries:

•

PCL [28] to manipulate point cloud data.

•

RDKit [21] to read and export SMILES.

•

deepchem[26] to run the molecule baseline and export the SMILES.

•

Sol2[33] a lua wrapper to communicate between c++ and torch.

•

Boost[5] generic c++ tool to read and manipulate data.

Additionally to facilitate deployment on our clusters we used Docker, Kubernetes, Singularity and Shifter.

Appendix G Code release

We made the full code and datasets to run the Set and Graph experiment available at the following location https://gitlab.com/nips6828Submission.

The Set experiment is available at https://gitlab.com/nips6828Submission/pointcloud and the Graph experiment is available at https://gitlab.com/nips6828Submission/molecule.

For ease of use, we also published an official docker container (www.docker.com/) for both repositories. To use the docker container you need to use a modern linux kernel ( $>3.xx$ ) have an nvidia gpu with up to date driver ( $>4xx$ ) and have nvidia-docker (https://github.com/NVIDIA/nvidia-docker).

Docker compatible variant like singularity, shifter or kubernetes can also be used.

Once the prerequisites are installed the image can be downloaded for the pointcloud dataset by:

docker pull registry.gitlab.com/nips6828submission/pointcloud:latest

or for the molecule dataset by:

docker pull registry.gitlab.com/nips6828submission/molecule:latest

Then you can enter the container with:

docker run --runtime=nvidia -it registry.gitlab.com/nips6828submission/pointcloud:latest

or

docker run --runtime=nvidia -it registry.gitlab.com/nips6828submission/molecule:latest

Once inside, the different binaries can be executed. Help about the options can be obtained by using the --help option.

Note that the release of theses containers is mainly for demonstration purpose. For real experiments it is recommended to store the dataset together with the result in a mounted folder.

In case of issues running the code an email can be send to [email protected] or by posting an issue to the repository.

Upon acceptance of the paper the code will be published under our real name.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] David Alvarez-Melis and Tommi S. Jaakkola. Tree-structured decoding with doubly-recurrent neural networks. November 2016.
2[2] Maria-Florina Balcan and Kilian Q. Weinberger, editors. Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings . JMLR.org, 2016.
3[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph
4[4] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum Learning .
5[5] Boost. Boost C++ Libraries. http://www.boost.org/ , 2019.
6[6] Nathan Brown, Marco Fiscato, Marwin H. S. Segler, and Alain C. Vaucher. Guaca Mol: Benchmarking Models for De Novo Molecular Design. ar Xiv:1811.09621 [physics, q-bio] , November 2018. ar Xiv: 1811.09621.
7[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ar Xiv:1406.1078 [cs, stat] , June 2014. ar Xiv: 1406.1078.
8[8] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch 7: A matlab-like environment for machine learning. In Big Learn, NIPS Workshop , 2011.