Neuroevolution with Perceptron Turing Machines
David Landaeta

TL;DR
This paper presents perceptron Turing machines as a novel framework for neuroevolution, enabling scalable solutions, easier experimentation with hand-coded solutions, and improved interpretability of evolved systems.
Contribution
It introduces perceptron Turing machines and a high-level language Lopro, enhancing neuroevolution's scalability, flexibility, and understanding of solutions.
Findings
Automatic scaling to larger problem sizes
Facilitation of hand-coded solution experimentation
Potential for better understanding of evolved solutions
Abstract
We introduce the perceptron Turing machine and show how it can be used to create a system of neuroevolution. Advantages of this approach include automatic scaling of solutions to larger problem sizes, the ability to experiment with hand-coded solutions, and an enhanced potential for understanding evolved solutions. Hand-coded solutions may be implemented in the low-level language of Turing machines, which is the genotype used in neuroevolution, but a high-level language called Lopro is introduced to make the job easier.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Neural Networks and Applications · Metaheuristic Optimization Algorithms Research
Neuroevolution with Perceptron Turing Machines111Patent pending.
David Landaeta
Natural Computation LLC
Abstract
We introduce the perceptron Turing machine and show how it can be used to create a system of neuroevolution. Advantages of this approach include automatic scaling of solutions to larger problem sizes, the ability to experiment with hand-coded solutions, and an enhanced potential for understanding evolved solutions. Hand-coded solutions may be implemented in the low-level language of Turing machines, which is the genotype used in neuroevolution, but a high-level language called Lopro is introduced to make the job easier.
1 Introduction
The current most popular method for creating an artificial neural network (ANN) relies on human ingenuity to create the network structure, and backpropagation during training to set values for weights and biases. This approach is effective at specific tasks, but it is generally accepted that biological nervous systems must be doing something very different when they learn new tasks.
Neuroevolution (NE) has been proposed as an alternative (or supplemental) mechanism that is more consistent with known biological processes. NE uses some form of evolutionary algorithm to either create the network structure, or set the values of weights and biases, or do both. The motivation for NE goes beyond the attempt to better understand the biological mechanisms for learning; there are some machine learning tasks for which NE seems to be more appropriate than backpropagation, such as playing games where the fitness of an ANN is only known after a sequence of interactions [6].
We introduce the perceptron Turing machine (PTM), which is a variant of the well-known alternating Turing machine (ATM). We show how to create an NE system by viewing the instructions of a PTM program as the genes of a genotype whose phenotype is an ANN. This NE system simultaneously evolves both the network structure and the connection weight and bias values. According to the NE classification given in Floreano et al. [3], the genotype is developmental, since it encodes a specification for building the ANN rather than encoding the ANN directly. The developmental approach has the advantage of potentially describing a large network with a small amount of code, which is useful in data compression.
2 Alternating Turing Machines
We start with a definition of an ATM that emphasizes the relationship between the ATM model and the uniform circuit model of parallel computation [1]. In the uniform circuit model, a program is given numbers and produces a description of a Boolean circuit having outputs and inputs. Thus, running a program is a two-phase process that creates a circuit in the build phase and feeds inputs to the circuit producing outputs in the execution phase. This point of view is helpful, because the PTM model simply replaces a specification for building a circuit with a specification for building an ANN. More importantly, this approach provides an automatic way for the circuit (or ANN) to scale up to larger problem sizes.
An ATM with instructions and tapes is a tuple:
[TABLE]
where:
is the finite set of states; 2. 2.
is the tape alphabet; 3. 3.
is the initial state; 4. 4.
is a function that maps the states of to the gate types of a circuit, and 5. 5.
is the program of , where
[TABLE]
is the set of all possible instructions for , and are tape head movements: , , and .
A configuration of is all the information needed for an instantaneous description of the machine: the current state, the contents of all tapes, and the positions of the read-write tape heads. An instruction describes how a configuration may (nondeterministically) lead to another configuration in one time step, which is denoted by . Suppose
[TABLE]
Then whenever the current state is and the symbols are scanned on the tapes, we may change to state , write symbols on the tapes, and move the tape heads according to .
Any state with is considered to be a final state, meaning that it is not possible to transition out of a configuration in state regardless of any instructions to the contrary in . We refer to such a configuration as an input configuration.
We imagine the input as an array of bits in random access memory, where each bit is directly addressable. Some of the tapes of may be designated as input index tapes. Suppose there is exactly one such tape; then reads the input bit at position by writing in binary on the input index tape and transitioning to a state with . At that point, automatically transitions to a state with if the input bit is 1, or it transitions to a state with if the input bit is 0. Replacing the gate type read with read-inverted in this scenario causes to read the inverse of the input bit, so that ends up in state with if the input bit is 1, or it ends up in state with if the input bit is 0. Note that these automatic transitions override any relevant instructions in .
If there are multiple input index tapes, then they are viewed as specifying the coordinates of the input bit within a multidimensional array, but apart from that, the steps required to read an input bit (or its inverse) are the same.
There may be no designated input index tapes, in which case the input is viewed as a zero-dimensional array; in other words, the input is a single bit.
Some of the tapes of may be designated as output index tapes, which are used to create the effect of producing a multidimensional array of output bits. In order to produce the output bit with coordinates , is started in an initial configuration having state , the values written in binary on the respective output index tapes, all other tapes having empty contents, and all tape heads at their leftmost positions. We refer to such a configuration as an output configuration.
We view as a specification for building a circuit by identifying the configurations of with the gates of the circuit. The gate type is given by applying to the current state in the configuration. For any two configurations and , an input of the gate for is linked to the output of the gate for if and only if . A gate with type true has no inputs and a constant output value of true. Similarly, a gate with type false has no inputs and a constant output value of false. A gate with a type in may have any number of inputs—even zero, in which case the gate acts as if its type is false, otherwise the gate performs the logical function indicated by its type.
Some practical considerations for the build phase need to be addressed. First, we need to put limits on the lengths of all tapes; otherwise, the build might go on forever. These limits on the input and output index tapes specify the sizes of the input and output arrays, and on other tapes they specify the amount of additional work space we expect the computation to require. We don’t require all tapes to have the same size, but we do require the sizes to be fixed. This means that the tape contents can’t really be empty, so instead we use a string of zeros as the default tape contents.
The computation needs to know when the end of a tape has been reached, and we achieve this by designating one tape symbol to be an endmark. We only need one endmark, because we consider the tape to be circular, so that moving right from the endmark automatically positions the tape head at the first symbol on the tape. We allow moving left from the endmark to access the last tape symbol in one step. We consider the tape head positioned at the endmark to be the default (leftmost) position. We ignore any tape symbol overwrites in instructions of that would change an endmark to a non-endmark or a non-endmark to an endmark.
Second, we need to eliminate the possibility of cycles in the circuit. We do this by building the circuit in depth-first order, from output configurations to input configurations, which allows us to detect cycles as we go. If adding a link between two gates would create a cycle, then we simply don’t add the link. Note that this is the only reason why we defined the ATM program to be an ordered list of instructions: the decision on which link to eliminate in order to break a cycle might depend on the order of instructions in the program. Apart from this, an unordered set of instructions would do just as well. However, we will see below in the PTM model that it is even more important for the program to be an ordered list of instructions, since the number of occurrences of a given instruction in the program has an affect on the weight and bias values in the ANN that is produced.
Further optional resource restrictions should be considered, such as limits on the total number of gates in the circuit, the depth of the circuit, or the fanout of the circuit, which is the maximum number of inputs on any gate. If a resource limit is exceeded, then we can either consider it to be a non-fatal error, meaning that the build is ended, but the circuit built to that point is retained and used for execution, or we can consider it to be fatal, in which case an exception is raised.
3 Perceptron Turing Machines
Our PTM model is a variant of the ATM model that replaces its specification for building a Boolean circuit with a specification for building an ANN. This is done in a straightforward way: the identification of a configuration with a logic gate becomes an identification of a configuration with a perceptron, which we hereafter refer to as a node.
A PTM with instructions and tapes is a tuple:
[TABLE]
where:
is the finite set of states; 2. 2.
is the tape alphabet; 3. 3.
is the output state; 4. 4.
is the input state, and 5. 5.
is the program of , where
[TABLE]
is the set of all possible instructions for .
This definition drops the explicit gate type designation of the ATM model, because an ANN only needs to distinguish between three types of nodes: output, input, and hidden. A configuration of corresponds to an output node if its state is ; it corresponds to an input node if its state is ; otherwise, it corresponds to a hidden node. State acts the same as the initial state of an ATM, with the same relationship to output index tapes, and state acts like an ATM state with a gate type of read, with the same relationship to input index tapes.
An instruction is interpreted the same as an ATM instruction, except that it now has an additional pair of integers taken from the set . We say that is the differential weight and is the differential bias associated with the instruction. For any two configurations , if there are any instructions in that would create a link , then we add up all the differential weights associated with those instructions in order to find the actual weight associated with the link. We find the bias value associated with the node corresponding to a configuration by adding up all differential bias values of any instructions in that would create a link of the form for some configuration , with the default bias being zero if no such exists.
As an example, consider how the node for a configuration might behave like an and-gate with inputs coming from nodes for configurations and . Of course, there must exist a sub-list of instructions in that allow and , and there must not exist any other instructions in that would allow a link of the form for some configuration , but we need to determine the differential weight and differential bias values associated with the instructions in in order to create and-like behavior. Assume we define the output of a perceptron in terms of its inputs using the following standard function.
[TABLE]
where is the vector of real-valued inputs, is the vector of real-valued weights, is the real-valued bias, and denotes the dot product. Then, to create and-like behavior, it suffices to have exactly four instructions in , two of which allow , and the other two allowing , where all four have a differential weight of , and exactly three of the four have a differential bias of .
4 Genetic Operators
The program of a PTM can be used as the genotype for a system of neuroevolution, where the instructions in are the genes. Since is both a program and a genotype, the system is not just an example of an evolutionary algorithm, but it is also an instance of genetic programming (GP). Moreover, the programming language used in this GP system is clearly Turing complete [2], because the PTM model is a generalization of the Turing machine model.
But what would be wrong with using the ATM model for our GP genotype rather than the PTM model? The big problem with this approach is that it becomes almost impossible for evolution, starting from a random population, to converge to anything but a trivial program—one that ignores its input and produces constant output. For, in order for an ATM program to be non-trivial, it must have at least one instruction that applies to an output configuration. Because such instructions are necessary for high fitness, they quickly spread throughout the population, causing the typical member to have several distinct instructions that apply to the output configuration. Thus, the output configuration typically corresponds to a gate with several inputs. But the output configuration must represent either an and-gate or an or-gate. If it is an and-gate with several inputs, then it is extremely likely to always output false. Similarly, an or-gate with several inputs is extremely likely to always output true. As a result, it is almost a certainty that the population will converge to a trivial program regardless of how training is performed.
By contrast, an output configuration in the PTM model corresponds to a perceptron, which is just as likely to output false as it is to output true, assuming the population starts out randomly. Moreover, the PTM model is much more robust under genetics than the ATM model, meaning that the phenotype is not likely to be radically altered when the normal genetic operators are applied to the genotype. This is due to the fact that a PTM instruction only modifies the network weights and biases in a small incremental fashion. Rather than, say, an and-gate changing to an or-gate in a single generation using ATM, we might have a perceptron smoothly changing from and-like behavior to or-like behavior over many generations using PTM.
In order to fully leverage this property of the PTM genotype, we should use at least one genetic operator that reorders genes within a genotype. For example, an inversion operator [4] would satisfy this requirement. When combined with crossover and a fine-grained mutation operator—one that can change only an individual component of an instruction, like the differential weight or differential bias—this creates the effect of smoothly searching the space of all weights and biases between nodes.
5 Lopro: A High-level Language for PTM Programming
One of the advantages of using the PTM model for neuroevolution is that it is actually an instance of GP. This makes the evolved genotype more than just a black box—it is a human-readable computer program. Granted, the programming language is very low-level, but it is based on the well-established language for programming Turing machines, so there is reasonable hope that an expert will be able to understand how the evolved ANN makes decisions.
But we can also turn this around: the fact that this is GP means that an expert can hand-code a genotype to implement a known solution to a problem. This has multiple advantages, including:
It allows the creation of sanity checks on our neuroevolution system: we can test using a problem with a known solution, and seed the initial population of genotypes with that known solution. If our process of evolution destroys the solution, then we know we have a bug somewhere in the system. 2. 2.
It allows us to seed the initial population with one or more known partial solutions to a target problem, in the hope of giving evolution a head start on producing a better solution. 3. 3.
Even in cases where we have no idea what the solution is, we can use our experience with hand-coded PTM programs to estimate the resource requirements—number of states, number of tapes, etc.—for an evolved solution to the target problem.
These tasks are made easier by the fact that we can create a high-level programming language—which doesn’t require an expert on Turing machines to understand—and implement a translator to change the high-level code into our low-level PTM code. We have done exactly that, and we call our high-level language Lopro, which is an acronym for “logic programming.” It uses the fact that the ATM (and PTM) model is very good at describing solutions in terms of first order logic.
Lopro is currently implemented as a collection of class, function, and macro definitions in C++ following the C++11 standard. It leverages powerful features of C++, including operator overloading and lambda expressions, in order to simplify the notation. Figure 1 shows a simple example program, which determines if there exists any bit in the input that is set to true. A more complex example is given by Figure 2, which computes the transitive closure of a given directed graph, where the input and output graphs are specified by their adjacency matrices.
The heart of Lopro is the Machine class, which embodies the PTM itself. It is a container for Tape and State objects as well as instructions. A Machine starts out in definition mode, during which tapes, states, and instructions are added to it. The various “New…” methods add tapes or states and return objects that are used to refer to what was just added. All such Lopro objects wrap a smart pointer to the originating Machine plus a unique integer identifier for the object, so it is always safe and efficient to copy such objects by value. The Build method ends definition mode, builds the ANN according to the definition, and puts the Machine in execution mode, during which we can feed inputs and receive outputs from the built ANN.
Every Tape object has an immutable “end” property, which is the integer specified in the argument to the NewTape method of Machine. This is how we limit the size of the tape, but rather than directly specifying the number of bits the tape can hold, the end value is one greater than the largest unsigned integer that we expect to be contained on the tape in binary. This is because the tape typically contains an array index, so it is normally most convenient to express the end value as the size of the corresponding array. Thus, the NewTape method computes the appropriate limit on the number of bits the tape can hold based on the specified end value. However, we must be aware that, because we allow any bit on the tape to be overwritten using the tape’s Head object, it may be possible (if the end value is not an exact power of two) to write a value on the tape that is greater than or equal to the end value. Fortunately, this condition is easy to detect and correct, as can be seen in the Exists function in Figure 1.
The input state is added with the NewInputState method of Machine, which takes a variable-length argument list denoting the input index tapes. Similarly, the NewOutputState method adds the output state and specifies the output index tapes. All other states are added with the NewState method, which takes no arguments.
The assignment operator is overloaded in the State class to provide a convenient way of creating instructions. An instruction in Lopro is a high-level analog of a PTM instruction. Like the latter, a Lopro instruction has a precondition on the machine configuration that must be satisfied in order for the instruction to apply, and it has an action that changes the configuration when the instruction is applied. Using the overloaded assignment operator, an instruction has the general form:
precondition_expression = action_expression;
The precondition expression may be just a State object, meaning that the precondition is satisfied whenever the machine is in that state regardless of tape contents or head positions. We would use a when clause—a usage of the WHEN macro—in order to add constraints on the tape contents or head positions, like so:
from_state WHEN (conditional_expression) = action_expression;
The conditional expression intensionally has the appearance of a Boolean expression in terms of Tape and Head objects, but the actual result type of the expression is an internal Conditional class, which can only be used in a when clause.
We can reproduce exactly the low-level PTM preconditions on tape contents and head positions by using the Head object, which is accessed using the immutable “head” property of the Tape object. Operators are overloaded in the Head class so that it behaves like an iterator for an array of bits. For example, the conditional expression:
*tape_head == 1
is satisfied whenever the Head object is scanning the symbol 1. The accessor method of the Head class is used to determine if the head is currently scanning the endmark:
tape_head.is_end()
Operators have been overloaded in the Conditional class so that such objects behave like Boolean values. Thus, the conditional expression:
*tape_head == 1 or tape_head.is_end()
is satisfied whenever the Head object is scanning either the symbol 1 or the endmark.
On the other hand, we can create high-level preconditions by using the fact that operators have been overloaded in the Tape class so that Tape objects behave like unsigned integers. For example, in the Exists function of Figure 1, we see the conditional expression:
test_tape >= test_tape.end()
which is satisfied whenever the contents of is greater than or equal to the end value of the tape. We use the convention that the left-most bits on the tape—accessed by moving the head right from the endmark—are the low-order bits of the unsigned integer corresponding to the tape contents.
If multiple instructions have a precondition expression mentioning the same State object , then, for any given configuration having state , only the first such instruction whose precondition is satisfied by will have its action applied to . The last such instruction typically has no when clause, so that it acts as a catch-all. It is obviously a good idea to group all such instructions together—effectively creating a single if-then-else statement.
The action expression of an instruction may be just a State object, meaning the configuration is changed to the specified state, but there is no change to tape contents or head positions. We would use an after clause—a usage of the AFTER macro—in order to specify changes to tape contents or head positions, like so:
precondition_expression = to_state AFTER { action_statements... };
The action statements may be low-level—using Head objects as iterators, or high-level—using Tape objects as unsigned integers, or it may be some combination of the two. Incrementing a Head object has the effect of moving the head right, and decrementing the Head object moves the head left.
Operators are overloaded for action expressions so that they behave like Boolean expressions. This gives us a high-level mechanism for specifying the differential weight and differential bias values of the corresponding PTM instructions. For example, the Lopro instruction:
from_state = to_state_1 and to_state_2;
translates to two PTM instructions and , creating network links of the form and , respectively, such that configuration (in state ) corresponds to a perceptron behaving like an and-gate with inputs coming from (in state ) and (in state ).
A subtlety of Lopro instruction definition is that the conditional expressions and action statements are not actually executed until the Build method is invoked. The WHEN and AFTER macros are hiding lambda expressions (capturing all variables by value) that make this possible. This is notationally very convenient, but it presents a technical problem for the use of Lopro objects as local variables within sub-functions: we can’t rely on the constructors and destructors of these local objects to provide appropriately timed initialization and cleanup, since the associated instructions are invoked outside of the lifetimes of the local variables.
This is the motivation for the Scope class. Every Machine object maintains a stack of Scope objects, which initially contains a default scope representing the main function. A Scope instantiation in a sub-function like the following:
lopro::Scope scope(machine);
pushes onto the top of the stack maintained by , and the corresponding destructor call for pops it off the stack. Whenever any “New…” method is invoked, and whenever the “head” property of a Tape is accessed, the Machine records that event and associates it with the Scope object at the top of the stack. During the build phase, the Machine can then detect when an action is taking place that changes the scope, and in that case it automatically invokes appropriate initialization or cleanup operations on the affected Lopro objects. The net result is that we can always rely on the following when we define instructions:
Tape contents initially have all bits set to 0, except for output index tapes, which are initialized to the appropriate coordinate of the output bit array. 2. 2.
Heads are initially scanning the endmark.
In addition, Tape objects associated with a sub-function scope are recycled if at all possible, which decreases the number of tapes used by the machine. This is important, because the maximum number of nodes in the ANN increases exponentially with the number of tapes.
There are two very desirable properties of Lopro programming. The first is that we can treat State objects as if they are predicates in a system of first order logic in which the variables stand for Tape objects, which can be treated as unsigned integers. From this point of view, a Lopro program simply defines an output predicate logically in terms of an input predicate. As an added bonus, this interpretation applies to sub-functions as well, resulting in a modular system of logic programming.
Note that the Exists function of Figure 1 defines the existential quantifier for this system of first order logic. As an exercise for the reader, write a function to define the corresponding universal quantifier. Hint: change two lines in the body of the Exists function, and rename the function to All. If you think you only need to change one line in the body, then think again.
The second desirable property of Lopro programming is that it naturally lends itself to efficient parallel computation. A good example is provided by the solution to the transitive closure problem shown in Figure 2. This solution is essentially a proof that the transitive closure problem is in the parallel complexity class , which is the class of all problems solvable by uniform circuits of size and depth , and is accepted as a conservative definition for the complexity class representing efficient parallel computation [5]. Such solutions arise naturally from the fact that an operation that scans over a tape completely in one direction requires only logarithmic depth, and many problems, like this one, require only a logarithmic number of such operations on any path from an output node to an input node.
6 Conclusion
Using perceptron Turing machines for neuroevolution has the following advantages.
Both the network structure and the connection weight and bias values are found through evolution. 2. 2.
A large network can be described with a small genotype. 3. 3.
Solutions automatically scale up to larger problem sizes. 4. 4.
The genotype is a human-readable computer program. 5. 5.
The programming language used by the genotype is Turing complete. 6. 6.
The genotype is robust, which allows the solution space to be explored by evolution in a smooth fashion. 7. 7.
We have the ability to experiment with hand-coded solutions in both low-level code and high-level code.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Balcázar JL, Diaz J, Gabarró J. Structural Complexity II , Springer-Verlag, Berlin, 1990.
- 2[2] Banzhaf W, Nordin P, Keller RE, Francone FD. Genetic Programming: An Introduction , Morgan Kaufmann, San Francisco, 1998.
- 3[3] Floreano D, Dürr P, and Mattiussi C. Neuroevolution: from architectures to learning. Evolutionary Intelligence , 1(1):47–62, 2008.
- 4[4] Goldberg D. Genetic Algorithms in Search, Optimization and Machine Learning , Addison-Wesley, Reading, MA, 1989.
- 5[5] Papadimitriou CH. Computational Complexity , Addison-Wesley, New York, 1994.
- 6[6] Risi S and Togelius J. Neuroevolution in games: State of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games , 9(1):25–41, 2017.
