Analyzing Populations of Neural Networks via Dynamical Model Embedding

Jordan Cotler; Kai Sheng Tai; Felipe Hern\'andez; Blake Elias; David; Sussillo

arXiv:2302.14078·cs.LG·March 1, 2023

Analyzing Populations of Neural Networks via Dynamical Model Embedding

Jordan Cotler, Kai Sheng Tai, Felipe Hern\'andez, Blake Elias, David, Sussillo

PDF

Open Access

TL;DR

This paper introduces DYNAMO, a novel algorithm that embeds neural networks into a low-dimensional manifold, enabling analysis, clustering, averaging, and semi-supervised learning of models based on their high-level computational similarities.

Contribution

DYNAMO constructs a model embedding space for neural networks, facilitating interpretation and manipulation of models beyond traditional parameter-based methods.

Findings

01

Model embedding spaces cluster networks by computational process

02

Model averaging produces networks with similar performance

03

Semi-supervised learning benefits from the embedding space

Abstract

A core challenge in the interpretation of deep neural networks is identifying commonalities between the underlying algorithms implemented by distinct networks trained for the same task. Motivated by this problem, we introduce DYNAMO, an algorithm that constructs low-dimensional manifolds where each point corresponds to a neural network model, and two points are nearby if the corresponding neural networks enact similar high-level computational processes. DYNAMO takes as input a collection of pre-trained neural networks and outputs a meta-model that emulates the dynamics of the hidden states as well as the outputs of any model in the collection. The specific model to be emulated is determined by a model embedding vector that the meta-model takes as input; these model embedding vectors constitute a manifold corresponding to the given population of models. We apply DYNAMO to both RNNs and…

Equations29

L_{output} [F, G, θ_{n}]

L_{output} [F, G, θ_{n}]

L_{hidden} [F, θ_{n}, V_{n}]

E_{{x_{t}} \sim D_{n}} [L_{hidden} [F, θ_{n}, V_{n}] + λ L_{output} [F, G, θ_{n}]]

E_{{x_{t}} \sim D_{n}} [L_{hidden} [F, θ_{n}, V_{n}] + λ L_{output} [F, G, θ_{n}]]

L [F, G, {θ_{n}}_{n = 1}^{N}, {V_{n}}_{n = 1}^{N}] := \frac{1}{N} n = 1 \sum N E_{{x_{t}} \sim D_{n}} [L_{n, hidden} [F, θ_{n}, V_{n}] + λ L_{n, output} [F, G, θ_{n}]] .

L [F, G, {θ_{n}}_{n = 1}^{N}, {V_{n}}_{n = 1}^{N}] := \frac{1}{N} n = 1 \sum N E_{{x_{t}} \sim D_{n}} [L_{n, hidden} [F, θ_{n}, V_{n}] + λ L_{n, output} [F, G, θ_{n}]] .

L_{output}^{CNN} [F, θ]

L_{output}^{CNN} [F, θ]

L_{hidden}^{CNN} [F, θ_{n}, {V_{n, t}}_{t = 1}^{B}]

L^{CNN} [F, {θ_{n}}_{n = 1}^{N}, {{V_{n, t}}_{t = 1}^{B}}_{n = 1}^{N}]

L^{CNN} [F, {θ_{n}}_{n = 1}^{N}, {{V_{n, t}}_{t = 1}^{B}}_{n = 1}^{N}]

:= \frac{1}{N} n = 1 \sum N E_{x_{0} \sim D_{n}} [L_{n, hidden}^{CNN} [F, θ_{n}, {V_{n, t}}_{t = 1}^{B}] + λ L_{n, output}^{CNN} [F, θ_{n}]]

Score (θ) = x \in W_{positive} \sum G (F (θ, x, h^{*})) - x \in W_{negative} \sum G (F (θ, x, h^{*})) - x \in W_{neutral} \sum ∣ G (F (θ, x, h^{*})) ∣,

Score (θ) = x \in W_{positive} \sum G (F (θ, x, h^{*})) - x \in W_{negative} \sum G (F (θ, x, h^{*})) - x \in W_{neutral} \sum ∣ G (F (θ, x, h^{*})) ∣,

F (x) = (Φ^{- 1} \circ E \circ Φ) (x) .

F (x) = (Φ^{- 1} \circ E \circ Φ) (x) .

F^{n} (x) = (Φ^{- 1} \circ E^{n} \circ Φ) (x) .

F^{n} (x) = (Φ^{- 1} \circ E^{n} \circ Φ) (x) .

L_{hidden} [F, θ, V] := \frac{1}{T} t = 1 \sum T ∥ V (h_{θ, t}) - h_{t} ∥_{2}^{2},

L_{hidden} [F, θ, V] := \frac{1}{T} t = 1 \sum T ∥ V (h_{θ, t}) - h_{t} ∥_{2}^{2},

F (x, h) = (V \circ F) (θ, x, V^{- 1} h) .

F (x, h) = (V \circ F) (θ, x, V^{- 1} h) .

h_{t} = (F_{x_{t}} \circ F_{x_{t - 1}} \circ \dots \circ F_{x_{1}}) (h_{0}) = (V \circ F_{θ, x_{t}} \circ F_{θ, x_{t - 1}} \circ \dots \circ F_{θ, x_{1}}) (V^{- 1} h_{0}) = V h_{θ, t}

h_{t} = (F_{x_{t}} \circ F_{x_{t - 1}} \circ \dots \circ F_{x_{1}}) (h_{0}) = (V \circ F_{θ, x_{t}} \circ F_{θ, x_{t - 1}} \circ \dots \circ F_{θ, x_{1}}) (V^{- 1} h_{0}) = V h_{θ, t}

F (x, V h) = (V \circ F) (θ, x, h) .

F (x, V h) = (V \circ F) (θ, x, h) .

F_{n} (x, V_{n} h) \approx (V_{n} \circ F) (\overline{θ}, x, h)

F_{n} (x, V_{n} h) \approx (V_{n} \circ F) (\overline{θ}, x, h)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis

Full text

Analyzing Populations of Neural Networks via Dynamical Model Embedding

Jordan Cotler Harvard Society of Fellows, Harvard University

Kai Sheng Tai Department of Computer Science, Stanford University

Felipe Hernández Department of Mathematics, Stanford University

Blake Elias

David Sussillo Department of Electrical Engineering & Wu Tsai Neurosciences Institute, Stanford University

Abstract

A core challenge in the interpretation of deep neural networks is identifying commonalities between the underlying algorithms implemented by distinct networks trained for the same task. Motivated by this problem, we introduce Dynamo, an algorithm that constructs low-dimensional manifolds where each point corresponds to a neural network model, and two points are nearby if the corresponding neural networks enact similar high-level computational processes. Dynamo takes as input a collection of pre-trained neural networks and outputs a meta-model that emulates the dynamics of the hidden states as well as the outputs of any model in the collection. The specific model to be emulated is determined by a model embedding vector that the meta-model takes as input; these model embedding vectors constitute a manifold corresponding to the given population of models. We apply Dynamo to both RNNs and CNNs, and find that the resulting model embedding spaces enable novel applications: clustering of neural networks on the basis of their high-level computational processes in a manner that is less sensitive to reparameterization; model averaging of several neural networks trained on the same task to arrive at a new, operable neural network with similar task performance; and semi-supervised learning via optimization on the model embedding space. Using a fixed-point analysis of meta-models trained on populations of RNNs, we gain new insights into how similarities of the topology of RNN dynamics correspond to similarities of their high-level computational processes.

1 Introduction

A crucial feature of neural networks with a fixed network architecture is that they form a manifold by virtue of their continuously tunable weights, which underlies their ability to be trained by gradient descent. However, this conception of the space of neural networks is inadequate for understanding the computational processes the networks perform. For example, two neural networks trained to perform the same task may have vastly different weights, and yet implement the same high-level algorithms and computational processes (Maheswaranathan et al., 2019b).

In this paper, we construct an algorithm which provides alternative parametrizations of the space of RNNs and CNNs with the goal of endowing a geometric structure that is more compatible with the high-level computational processes performed by neural networks. In particular, given a set of neural networks with the same or possibly different architectures (and possibly trained on different tasks), we find a parametrization of a low-dimensional submanifold of neural networks which approximately interpolates between these chosen “base models”, as well as extrapolates beyond them. We can use such model embedding spaces to cluster neural networks and even compute model averages of neural networks. A key feature is that two points in model embedding space are nearby if they correspond to neural networks which implement similar high-level computational processes, in a manner to be described later. In this way, two neural networks may correspond to nearby points in model embedding space even if those neural networks have distinct weights or even architectures.

The model embedding space is parametrized by a low-dimensional parameter $\theta\in\mathbb{R}^{d}$ , and each base model is assigned a value of $\theta$ in the space. This allows us to apply traditional ideas from clustering and interpolation to the space of neural networks. Moreover, each model embedding space has an associated meta-model which, upon being given a $\theta$ , is rendered into an operable neural network. If a base model is mapped to some $\theta$ , then the meta-model, upon being given that $\theta$ , will emulate the corresponding base model. See Figure 1 for a diagrammatic depiction. An interesting application is that given two base models assigned to parameters $\theta_{1}$ and $\theta_{2}$ , we can consider the averaged model corresponding to the value $(\theta_{1}+\theta_{2})/2$ . We find that this averaged model performs similarly on the task for which the two base models were trained. We also use the model embedding space to extrapolate outside the space of base models, and find cases in which the model embedding manifold specifies models that perform better on a task than any trained base model. Later on we will explain how this can be regarded as a form of semi-supervised learning.

The rest of the paper is organized as follows. We first provide the mathematical setup for our algorithmic construction of model embedding spaces. After reviewing related work, we then present results of numerical experiments which implement the algorithm and explore clustering, model averaging, and semi-supervised learning on the model embedding space. We further examine how topological features of the dynamics of RNNs in model embedding spaces are reflective of classes of high-level computational processes. Finally, we conclude with a discussion.

2 Dynamical Model Embedding

2.1 Mathematical Setup

In this Section, we provide the mathematical setup for construction of model embedding spaces. We treat this in the RNN setting, and relegate the CNN setting to Appendix A.

Notation. Let an RNN be denoted by $F(x,h)$ where $x$ is the input and $h$ is the hidden state. We further denote the hidden-to-output map by $G(h)$ . We consider a collection of $N$ RNNs $\{(F_{n},G_{n})\}_{n=1}^{N}$ we call the base models which may each have distinct dimensions for their hidden states, but all have the same dimension for their inputs as well as the same dimension for their outputs. A sequence of inputs is notated as $\{x_{t}\}_{t=0}^{T}$ which induces a sequence of hidden states by $h_{t+1}=F(x_{t},h_{t})$ where the initial hidden state $h_{0}$ is given. A collection of sequences of inputs, possibly each with different maximum lengths $T$ , is denoted by $\mathcal{D}$ which we call an input data set. We suppose that each base model RNN has an associated input data set.

2.2 Meta-Models for RNNs

Given a collection base model RNNs, we would like to construct a meta-model which emulates the behavior of each of the base models. In this case, the meta-model is itself an RNN with one additional input $\theta\in\mathbb{R}^{d}$ and a corresponding map $\widetilde{F}(\theta,x,h)$ whose output is the next hidden state. Given a sequence of input states $\{x_{t}\}_{t=0}^{T}$ , we have a corresponding sequence of output states $h_{\theta,t+1}=\widetilde{F}(\theta,x_{t},h_{\theta,t})$ starting from an initial hidden state $h_{0}$ (which we suppose does not depend on $\theta$ ). The meta-model also includes a hidden-to-output map $\widetilde{G}(h)$ that is independent of $\theta$ .

For the meta-model $(\widetilde{F},\widetilde{G})$ to emulate a particular base model $(F_{n},G_{n})$ with respect to its corresponding data set $\mathcal{D}_{n}$ , we consider the following criteria: there is some $\theta_{n}$ for which

$\widetilde{G}(h_{\theta_{n},t})\approx G_{n}(h_{t})$ for all $t>0$ and all input sequences in the data set; and 2. 2.

$V_{n}(h_{\theta,t})\approx h_{t}$ for all $t>0$ and all input sequences in the data set,

where $V_{n}$ is a transformation of a meta-model’s hidden activity. We emphasize that $\theta_{n}$ and $V_{n}$ depend on the particular base model under consideration. The first criterion means that at some particular $\theta_{n}$ , the outputs of the meta-model RNN dynamics are close to the outputs of the base model RNN dynamics. The second criterion means that at the same $\theta_{n}$ , there is a time-independent transformation $V_{n}$ (i.e., $V_{n}$ does not depend on $t$ ) such the transformed hidden state dynamics of the meta-model are close to the hidden state dynamics of the base model. See Figure 2 for a visualization. As depicted in the Figure, it is convenient to regard the meta-model RNN as having inputs $(\theta_{n},x)$ . As such, a sequence of inputs $\{x_{t}\}_{t=1}^{T}$ is appended by $\theta_{n}$ to become $\{(\theta_{n},x_{t})\}_{t=1}^{T}$ .

The desired properties of the meta-model are enforced by the loss function. Defining the functions

[TABLE]

where $d$ is some suitable distance or divergence, we can construct the loss function

[TABLE]

where we average over the choice of sequence $\{x_{t}\}$ coming from the input data set $\mathcal{D}_{n}$ . Above, $\lambda$ is a hyperparameter. Our aim is to minimize 3 over a suitable class of $\widetilde{F},\widetilde{G},V_{n}$ , as well as $\theta_{n}$ ; this can be implemented computationally via the Dynamo algorithm (see Algorithm 1). As a side remark, it naïvely appears that a suitable alternative choice to the $\mathcal{L}_{\text{hidden}}$ in equation 2 would be $\frac{1}{T}\sum_{t=1}^{T}\left\|h_{\theta_{n},t}-W_{n}(h_{t})\right\|_{2}^{2}$ where here $W_{n}$ is a map from the hidden states of the base model to the hidden states of the meta-model. However, this would be problematic since minimization may pressure $W_{n}$ to be the the zero map (or otherwise have outputs which are small in norm) and accordingly pressure the dynamics of the meta-model to be trivial (or have small norm). As such, we opt to formulate $\mathcal{L}_{\text{hidden}}$ as it is written in equation 2.

Suppose we want the meta-model to be able to emulate an entire collection of base models $\{(F_{n},G_{n})\}_{n=1}^{N}$ . In particular, the meta-model will attempt to assign to the $n$ th base model a $\theta_{n}$ and a $V_{n}$ so that the two criteria listed above on page 3 are satisfied for that base model. These desiderata can be implemented by minimizing the loss function

[TABLE]

In some circumstances, we may want to consider base models with distinct dimensions for their output states. For instance, suppose half of the base models perform a task with outputs in $\mathbb{R}^{d_{1}}$ and the other half perform a task without outputs in $\mathbb{R}^{d_{2}}$ . To accommodate for this, we can have two hidden-to-output maps $\widetilde{G}_{1},\widetilde{G}_{2}$ for the meta-model, where the maps have outputs in $\mathbb{R}^{d_{1}}$ and $\mathbb{R}^{d_{2}}$ respectively. The loss function is slightly modified so that we use $\widetilde{G}_{1}$ when we compare the meta-model to the first kind of base model, and $\widetilde{G}_{2}$ when we compare the meta-model to the second kind of base model. This construction generalizes to the setting where the base models can be divided up into $k$ groups with distinct output dimensions; this would necessitate $k$ hidden-to-output functions $\widetilde{G}_{1},...,\widetilde{G}_{k}$ for the meta-model.

3 Related Work

There is a substantial body of work on interpreting the computational processes implemented by neural networks by studying their intermediate representations. Such analyses have been performed on individual models (Simonyan et al., 2013; Zeiler and Fergus, 2014; Lenc and Vedaldi, 2015) and on collections of models (Li et al., 2016; Raghu et al., 2017; Morcos et al., 2018; Kornblith et al., 2019). Prior work in this latter category has focused on pairwise comparisons between models. For example, SVCCA (Raghu et al., 2017) uses canonical correlation analysis to measure the representational similarity between pairs of models. While these methods can also be used to derive model representations by embedding the pairwise distance matrix, our approach does not require the $\Theta(N^{2})$ computational cost of comparing all pairs of base models. Moreover, Dynamo yields an executable meta-model that can be run with model embedding vectors other than those corresponding to the base models.

There is a related body of work in the field of computational neuroscience. CCA-based techniques and representational geometry are standard for comparing neural networks to the neural activations of animals performing vision tasks (Yamins and DiCarlo, 2016) as well as motor tasks. In an example of the latter, the authors of (Sussillo et al., 2015) used CCA techniques to compare brain recordings to those of neural networks trained to reproduce the reaching behaviors of animals, while the authors of (Maheswaranathan et al., 2019b) used fixed point analyses of RNNs to consider network similarity from a topological point of view.

Dynamo can be viewed as a form of knowledge distillation (Hinton et al., 2015), since the outputs of the base models serve as targets in the optimization of the meta-model. However, unlike typical instances of knowledge distillation involving an ensemble of teacher networks (Hinton et al., 2015; Fukuda et al., 2017) where individual model predictions are averaged to provide more accurate target labels for the student, our approach instead aims to preserve the dynamics and outputs of each individual base model. FitNets (Romero et al., 2015) employ a form a knowledge distillation using maps between hidden representations to guide learning; this is similar to our use of hidden state maps in training the meta-model.

Our treatment of the model embedding vectors $\{\theta_{n}\}$ as learnable parameters is similar to the approach used in Generative Latent Optimization (Bojanowski et al., 2018), which jointly optimizes the image generator network and the latent vectors corresponding to each image in the training set. Bojanowski et al. (2018) find that the principal components of the image representation space found by GLO are semantically meaningful. We likewise find that the principal components of the model embeddings found by Dynamo are discriminative between subsets of models.

Unlike methods such as hypernetworks (Ha et al., 2017) and LEO (Rusu et al., 2019) that use a model to generate parameters for a separate network, our approach does not attempt to reproduce the parameters of the base models in the collection. Instead, the meta-model aims to reproduce only the hidden states and outputs of a base model when conditioned on the corresponding embedding vector.

The core focus of our work also differs from that of the meta-learning literature (Santoro et al., 2016; Ravi and Larochelle, 2017; Finn et al., 2017; Munkhdalai and Yu, 2017), which is primarily concerned with the problem of few-shot adaptation when one is presented with data from a new task. Our empirical study centers on the post-hoc analysis of a given collection of models, which may or may not have been trained on different tasks. However, we remark that our exploration of optimization in low-dimensional model embedding space is related to LEO (Rusu et al., 2019), where a compressed model representation is leveraged for efficient meta-learning.

4 Empirical Results

In this Section, we describe the results of our empirical study of meta-models trained using Dynamo on collections of RNNs trained on NLP tasks, and collections of CNNs trained for image classification. For RNN base models, we parameterize the meta-model as a GRU where the model embedding vector $\theta$ is presented as an additional input at each time step. For CNN base models, we use a ResNet meta-model where $\theta$ is an additional input for each ResNet block. In all our experiments, we hold out half the available training data for use as unlabeled data for training the meta-model; the base models were trained on the remaining training data (or a fraction thereof).111For example, our IMDB sentiment base models trained on $100\%$ of the available training data were trained on 12,500 examples, with the remaining 12,500 examples used as unlabeled data for training the meta-model. By default, we set the output loss hyperparameter $\lambda$ to $1$ . We defer further details on model architectures and training to the Appendices A and B.

4.1 Visualizing Model Similarity in Embedding Space

The base model embeddings $\{\theta_{n}\}_{n=1}^{N}$ can be used for cluster analysis to evaluate the similarity structure of a collection of models. We illustrate this use case of Dynamo via a series of example applications including NLP and vision tasks. Figure 3 shows model embeddings learned by Dynamo on collections of RNNs. In these plots, we observe a clear separation of these networks according to the size of the available training data, the RNN model architecture, and the specific NLP task used for training. By computing the eigenvalues of the covariance matrix corresponding to the $N$ model embeddings, we obtain a measure of the intrinsic dimensionality of the corresponding collections of models. In Figure 3, we find that 2 to 6 components are sufficient to explain $95\%$ of the variance in the model embeddings.

By tuning the output loss hyperparameter $\lambda$ in equation 4, we can adjust the degree of emphasis placed on reproducing the hidden dynamics of the base networks versus their outputs. We demonstrate this effect in Figure 4: with $\lambda=1$ , we observe clear separation of ResNet-34 models trained on CIFAR-100 with different data augmentation policies (“weak”, with shifts and horizontal flips vs. “strong”, with RandAugment (Cubuk et al., 2020)), and with different training set sizes. In contrast, with $\lambda=0$ we find a weak clustering effect corresponding to differing data augmentation, and we do not find a detectable separation corresponding to differing training set sizes. We infer that the change in data augmentation policy results in a larger difference in the learned feature representations than the change in training set size. In Appendix B we illustrate the effect of setting $\lambda=0$ for RNN models.

We additionally compare the embeddings obtained using Dynamo to those derived from SVCCA (Raghu et al., 2017), a pairwise comparison technique that aligns the representations produced by a pair of networks using canonical correlation analysis (CCA). In Figure 5, we plot the 2D embeddings obtained using multidimensional scaling (MDS) on the pairwise distance matrix computed using SVCCA (modified to output $L^{2}$ distances instead of correlations). Unlike the principal components of the Dynamo model embeddings plotted in Figure 3, the MDS coordinates are not semantically interpretable. Additionally, the cluster structure of the collection of GRUs trained with varying training set sizes is less apparent in this representation.

Lastly, we note that Dynamo allows for flexibility in defining the metric used to compare the hidden states and outputs of the base models with those of the meta-model. In Appendix B.3, we demonstrate the benefit of using the $L^{1}$ distance for clustering CNN representations.

4.2 Extrapolation Beyond Base Model Embeddings

We study the model embedding space corresponding to a trained meta-model by conditioning it on model embedding vectors $\theta$ other than those assigned to the set of base models. Figure 6 visualizes the landscape of test accuracies for two meta-models: (i) a meta-model for 10 GRUs trained with $50\%$ of the IMDB training data and 10 GRUs trained with $25\%$ of the data; and (ii) a meta-model for 10 IMDB sentiment GRUs and 10 AG News classification GRUs.

We note two particularly salient properties of these plots. First, the test accuracy varies smoothly when interpolating $\theta$ between pairs of base model embeddings—we would in general not observe this property when interpolating the parameters of the base GRUs directly, since they were trained with different random initializations and orderings of the training examples. Second, we observe that the embedding vector that realizes the highest test accuracy lies outside the convex hull of the base model embeddings. This is perhaps surprising since typical training and inference protocols involve the use of convex combinations of various objects: for instance, averaging of predictions in model ensembles and averaging of model parameters during training (e.g., using exponential moving averages or Stochastic Weight Averaging (Izmailov et al., 2018)). This extrapolatory phenomenon suggests that Dynamo is able to derive a low-dimensional manifold of models that generalizes beyond the behavior of the base models used for training the meta-model.

4.3 Semi-Supervised Learning in Model Embedding Space

The existence of model embeddings that improve on the accuracy of the base models suggests a natural semi-supervised learning (SSL) procedure involving a trained meta-model. In particular, we minimize the loss incurred by the meta-model on a small set of additional labeled examples by optimizing the value of $\theta$ . This is done by backpropagating gradients through the meta-model, with the meta-model parameters held fixed. Figure 7 shows the result of this procedure on an IMDB sentiment meta-model (previously depicted in Figure 6) with a set of additional labeled examples (disjoint from the test set) of size equal to $1\%$ of the full training set. This procedure successfully finds a $\theta$ that improves on the test accuracy of the best base model by $6\%$ ( $86.4\%$ vs. $80.3\%$ ).

We observe that this SSL procedure achieves lower accuracy when we train the meta-model using fewer base models. In particular, a meta-model coming from only the 10 GRUs trained with $50\%$ of the training data yields a test accuracy of $85.4\%$ , and a meta-model coming from only a single GRU out of the 10 yields $81.6\%$ . This result suggests that a diversity of base models helps improve the accuracy achievable by the meta-model.

5 Dynamics of Meta-Models for RNNs

In this Section we perform an analysis of the dynamical features generated by the meta-models trained on base models that perform the sentiment classification task. Sentiment classification tasks have a well-understood dynamical structure (Sussillo and Barak, 2013; Maheswaranathan et al., 2019a, b; Aitken et al., 2020) that we can use as a basis for understanding the behavior of a corresponding meta-model. To a first approximation, the sentiment analysis task can be solved by a simple integrator that accumulates sentiment corresponding to each word (positive words such as ‘good’ or ‘fantastic’ adding positive sentiment, and negative words such as ‘bad’ or ‘terrible’ adding negative sentiment). It has been shown that simple sentiment analysis models approximate this integrator by constructing a line attractor in the space of hidden states. For instance, for the zero input $x^{*}=\vec{0}$ , it has been observed that the dynamical system generated by the map $F_{x^{*}}(h)=F(x^{*},h)$ has a tubular region with very little movement, in the sense that $\|F_{x^{*}}(h)-h\|_{2}$ is very small for hidden states $h$ in this region.

To investigate the dynamical behavior of the meta-model space, we trained a meta-model on a set of 20 base models which were themselves trained on the IMDB sentiment analysis task. Of these 20 base models, 10 were trained with $50\%$ of the available training data and the remaining 10 were trained with $100\%$ of the training data. The $\theta$ points corresponding to these base models cluster in the model embedding space according to the amount of training data. In Figure 8 we perform a fixed-point analysis of several models corresponding to points of interest in the model embedding space.

The fixed-point analysis was run according to the procedure described in (Golub and Sussillo, 2018). First we selected a set of candidate hidden states $h_{j}$ by running the model on a typical batch of inputs. For each hidden state $h_{j}$ obtained in this way, we used gradient descent on the loss $\|F(x^{*},h)-h\|_{2}^{2}$ to find the nearest approximate fixed point.

An interesting finding is that the meta-model found line attractor structures that were very geometrically similar for models within a cluster. An interpretation of this result pertaining to topological conjugacy in dynamical systems theory is discussed in Appendix C. Moreover, we find that the meta-model finds a continuous interpolation between line attractors that are relatively short and fat (corresponding to models trained on $50\%$ of the data), and models that are tall and thin (corresponding to models trained on $100\%$ of the data).

6 Discussion

We have introduced the algorithm Dynamo, which maps a set of neural network base models to a low dimensional feature space. Our results show that the model embeddings provided by Dynamo capture relevant computational features of the base models. Moreover, the model embedding spaces produced by Dynamo are sufficiently smooth that model averaging can be performed, and model extrapolation can be used to reach new models with better performance than any base model. In our experiments where the base models were trained on the sentiment analysis task, the model embedding space describes a space of line attractors which vary smoothly in the parameter $\theta$ .

We have demonstrated that Dynamo can be broadly applied to neural networks that have a dynamical structure; for example, we used the demarcation of layers of convolutional neural networks as a proxy for a dynamical time variable. This also suggests possible scientific applications of Dynamo to dynamical systems arising in nature. A present limitation of Dynamo is the need for all base models to have the same input structure. For example, one cannot presently utilize Dynamo to compare language models trained with different encodings (character-based vs. word-based, for example).

Acknowledgments

We would like to thank Will Allen, Semon Rezchikov, and Krishna Shenoy for valuable discussions. JC is supported by a Junior Fellowship from the Harvard Society of Fellows, as well as in part by the Department of Energy under grant DE-SC0007870. FH is supported by the Fannie and John Hertz Foundation.

Appendix A Meta-models for CNNs

Our setup in Section 2.2 above can be readily adapted to CNNs, or feedforward neural networks more broadly. For instance, in the case of ResNets, there is a single input $x_{0}$ followed by a sequence of blocks which operate on different numbers of channels. We let the meta-model likewise be a ResNet with the same block structure. Moreover, we consider the output $b_{t}$ of the $t$ th block in place of the $G_{n}(h_{t})$ ’s in equation 1, and take the hidden states between the $t$ and $t+1$ blocks to be the $h_{t}$ ’s in equation 2. Since each block of a ResNet is distinct, we consider a family of maps $V_{n,t}$ which depend on the base model $n$ and the block layer $t$ . Letting $B$ denote the total number of blocks, we can more explicitly write

[TABLE]

Then the total loss function accounting for all $N$ of the ResNet base models is

[TABLE]

where we note that in the expectation value we are only sampling over $x_{0}$ ’s in $\mathcal{D}_{n}$ since $x_{0}$ ’s are the only form of input data.

Appendix B Additional Experimental Details

In this Appendix, we provide further details on our experiments as well as additional empirical results.

B.1 Meta-Model Architectures

RNN architecture. We parameterize the meta-model for RNNs as a GRU that takes the model embedding vector as an additional input at each time step. Specifically, the meta-model GRU takes as input the vector $[\theta\,;\,x_{t}]$ at each time step, where $x_{t}$ is an input token embedding and $[\,\cdot\,\,;\,\cdot\,]$ denotes concatenation. The model embedding vector $\theta$ therefore serves as a time-independent bias for the meta-model.

CNN architecture. We parameterize the meta-model for CNNs with a modified ResNet architecture. In each residual block, we use a linear transformation $W$ to map the model embedding vector $\theta$ to the corresponding channel dimension of convolutional layer. We then add the vector $W\theta$ to the channels at each spatial location of the feature map. This design emulates the approach used in our parametrization of the RNN meta-model, with the model embedding $\theta$ serving as a bias term in each residual block. We reuse the weight matrix $W$ for all residual blocks with the same channel dimension. See Algorithm 2 which outlines our treatment of a residual block.

The standard ResNet architecture consists of a sequence of four layers, with each layer consisting of a sequence of residual blocks. To compute the hidden state loss $\mathcal{L}_{\mathrm{hidden}}$ in Dynamo, we compute distances between the output representations of each of these four layers, averaging over the number of channels and spatial locations in each set of features.

B.2 Training Details

Table 1 lists the hyperparameters used for training our RNN base models and meta-models, and Table 2 lists the hyperparameters used for our CNN base models and meta-models. By default, we use a model embedding dimension of $16$ . In the case of visualizing the training trajectories of IMDB sentiment GRUs (Figure 3, second column), we use a model embedding dimension of $32$ due to the relatively larger number of base models.

B.3 Supplementary Empirical Results

Effect of output loss weight. Figure 9 shows the effect of setting the output loss weight $\lambda=0$ for meta-models on RNNs. These plots illustrate that model clustering can be performed on the basis of comparing hidden state dynamics alone. We note that the change in the $\lambda$ hyperparameter results in qualitative changes in the resulting clustering. In particular, the model embeddings for GRUs trained on AG News classification (rightmost column of Figure 9) are much more tightly coupled relative to the GRUs trained on the IMDB dataset when $\lambda=0$ . This indicates that the dynamics implemented by the AG News GRUs are much more similar than those implemented by the IMDB sentiment GRUs.

Clustering with other loss functions. As noted in Section 4.1, the loss functions used to compare hidden states and outputs can be easily changed to better match the characteristics of the base models under consideration. We demonstrate the benefit of this additional flexibility by replacing the $L^{2}$ distance in $\mathcal{L}_{\mathrm{hidden}}$ (equation 6) with the $L^{1}$ distance for purposes of better comparing the intermediate representations of ResNets. This choice is motivated by the observation that the use of the ReLU nonlinearity results in sparse representations, which suggests the use of the $L^{1}$ metric. Figure 10 shows a comparison between these two distance functions in the case of ResNet-34 models trained on CIFAR-100 with “weak” data augmentation (random shifts and horizontal flips) and with “strong” data augmentation (RandAugment). Here, we parameterized the meta-model with a ResNet-50 architecture. The use of the $L^{1}$ distance results in a clearer separation between the two sets of models. This is reflected qualitatively in the distribution of the base model embeddings in model embedding space, and quantitatively in the relative scale of the variance captured by the first principal component.

B.4 Further investigation of the dynamics of sentiment analysis

Recall from Section 5 that our trained RNN’s implemented sentiment analysis via line attractor dynamics, in which inputted words kick the hidden state in a ‘positive’ or ‘negative’ direction along the line attractor according to the valence of the word (i.e. how positive or negative the word is). Figure 11 investigates how valences assigned to words change as we scan across model embedding space. We first find a fixed point $h^{*}$ with neutral readout $\widetilde{G}(h^{*})\approx 0$ . Then given a $\theta$ (which renders an RNN), we compute $\widetilde{G}(\widetilde{F}(\theta,x,h^{*}))$ for a variety of word inputs $x$ . To produce a “score” for the model, we compute

[TABLE]

where the set of positive words $W_{\text{positive}}$ , negative words $W_{\text{negative}}$ , and neutral words $W_{\text{neutral}}$ are listed in Table 3. The Figure shows that the score noticeably increases as $\theta$ tends in the direction of the model embedding which performs best on the sentiment analysis task. Note that here we are not analyzing context effects, for instance how the string ‘not terrible’ would be rendered into a net-positive valence.

Appendix C Meta-models and topological conjugacy

In this Appendix we describe the notion of topological conjugacy, its relationship to the loss function $\mathcal{L}_{\text{hidden}}$ , and provide speculation as to the interpretation of the results from Section 5.

A topological conjugacy (Katok and Hasselblatt, 1997) between two dynamical systems defined by maps $F:X\to X$ and $E:Y\to Y$ is a homeomorphism $\Phi:X\to Y$ satisfying

[TABLE]

Note that the dynamical systems $F$ and $E$ can have distinct domains. The significance of the relationship in equation 9 is that dynamics obtained from iterated applications of the map $F$ and $E$ are related to each other by the formula

[TABLE]

Thus, if $F$ and $E$ are topologically conjugate, then their iterates are also topologically conjugate and this means that the dynamics are related by a change of variables. Notice that topological conjugacy is an equivalence relation; as such, the transitive property tells us that if $F\sim E$ and $E\sim D$ then $F\sim D$ .

The notion of topological conjugacy is an important motivation for defining the loss function $\mathcal{L}_{\text{hidden}}$ , which we recall is given by

[TABLE]

where the hidden states $h_{t}$ are dynamics obtained from a base model $F$ , the hidden states $h_{\theta_{t}}$ are dynamics obtained from the meta-model $\widetilde{F}$ with parameter $\theta$ , and $V$ is the map from the meta-model hidden states to the base model hidden states. One way of obtaining zero loss is to find a topological conjugacy with map $V$ between the meta-model $\widetilde{F}$ (at fixed $\theta$ ) and the base model $F$ , meaning a relationship of the form

[TABLE]

It is convenient to define the notation $F_{x}(h):=F(x,h)$ and $\widetilde{F}_{\theta,x}(h):=\widetilde{F}(\theta,x,h)$ . Then we have

[TABLE]

with $h_{\theta,0}=V^{-1}h_{0}$ , so that $\mathcal{L}_{\text{hidden}}[\mathcal{F},\theta,V]=0$ would hold. As depicted in Figure 2, to obtain zero loss it would actually suffice to find a weaker relationship of the form

[TABLE]

The difference here is that $V$ need not be invertible.

Using the language of topological conjugacy, we can describe a speculative but plausible interpretation of the results of Section 5. In that Section we observed that models from the same cluster had very similar dynamical features and performed similarly to the model average of the cluster. This suggests that for each model $F_{n}$ in the same cluster, we have

[TABLE]

where $\overline{\theta}$ is the centroid of the cluster to which $F_{n}$ belongs. Note that here we replaced $\theta_{n}$ with $\overline{\theta}$ , thus assuming both that $\mathcal{L}_{\text{hidden}}$ is small and that $\theta_{n}$ is sufficiently close to $\overline{\theta}$ . Second, making the hypothesis that there exists an inverse $V_{n}^{-1}$ to the map $V_{n}$ , the map $V_{n}$ may provide a topological conjugacy between the base model $F_{n}$ and the meta-model $\widetilde{F}_{\overline{\theta}}$ evaluated at $\overline{\theta}$ . Assuming further that our assumptions hold for all models in the cluster, using the transitivity of topological conjugacy we would conclude that base models belonging to the same cluster are topologically conjugate to one another. This would justify the intuition suggested by Figure 8 that Dynamo clusters models according to commonalities of topological structures of dynamics.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aitken et al. (2020) Kyle Aitken, Vinay V. Ramasesh, Ankush Garg, Yuan Cao, David Sussillo, and Niru Maheswaranathan. The geometry of integration in text classification RN Ns. ar Xiv preprint ar Xiv:2010.15114 , 2020.
2Bojanowski et al. (2018) Piotr Bojanowski, Armand Joulin, David Lopez-Pas, and Arthur Szlam. Optimizing the latent space of generative networks. In International Conference on Machine Learning , 2018.
3Cubuk et al. (2020) Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2020.
4Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning , 2017.
5Fukuda et al. (2017) Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech , 2017.
6Golub and Sussillo (2018) Matthew D. Golub and David Sussillo. Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks. Journal of Open Source Software , 2018.
7Ha et al. (2017) David Ha, Andrew Dai, and Quoc V. Le. Hyper Networks. In International Conference on Learning Representations , 2017.
8Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop , 2015.