A Java implementation: genetic algorithm to develop and test neural networks with parallel training.

Robert Hildreth
CodeX
Published in
8 min readJul 25, 2021

--

This is a big topic to be able to cover in a single post. Therefore, I must assume familiarity with neural networks, very basic genetic algorithms, and Java as an object-oriented language while I intend to focus more closely on the implementation.

The idea behind the project is to develop several randomized neural networks, train them each in parallel for the shortest amount of time possible, evaluate their fitness, and proceed again into the next generation. This project collects the winners from each generation, scored by a ‘trainability’ factor, and presents them on the far side of the program. From that point, it is easy to isolate any of the winners and continue training them until desirable convergence. To this point, it has been proving to be more beneficial than running and testing networks by hand.

To demonstrate the usage of the construct I’ve dubbed ‘Transient’, I may be inclined to first introduce it as a small system with a lot of functionality, if not as a potent contender to my own other implementations of neural networks. This system is built to split network instances in parallel and compute I/O and training on all or any of the available networks. This implies a lot of functionality in that; do you ask one network about one thing? one network about many things? many networks about one thing? many…. so I built in most access patterns, in both parallel and linear fashions, for both querying and training.

Indexing Matters !! -> I primarily use streams in this implementation and can make no promises on the order of completion of each network per input instance. For this specific reason, I index each Matrix output to match the index of the network that generated the decision. Again, it is very likely that the stream of answers received will not calculate in the corresponding order of networks that generated them. Use your indices, please.

The primary question here is how we approach lining training data up for each network. Is it possible to guide parallel networks through a forward pass, then conduct a parallel backward pass? I chose to forgo such a method in favor of a simple fit() function. By providing the query and answer at the same time, each parallel network can go through to test and train consecutively while being hosted on a constant thread, instead of requesting a network be transferred to another worker thread for backpropagation. How then does one ensure that the network observes the example before beginning backpropagation with the answer? By leaning into a somewhat overlooked programming paradigm where resolution of objects in listed format is each initialized in the order of index, i.e. from left to right. By packaging the call to an array of Index(ed)Matrix objects, I ensure the order of calculation of each field independently, one before the other, on the same thread. E.g.

return new IM[]{(input)->return proc1.start(input), (target)->return proc2.start(target)};

first causes the processing of the function proc1, which will produce the network’s decision matrix. This will be followed in turn by the processing of the function proc2, correcting the network, and returning an ‘influence’ matrix. These basic listed commands will be separated by commas, hiding several commands in a single line. One example is in the equivalency of these two loops:

//Traverses the 2D-plane in two loopsfor(int y = 0; y < 5; y ++) {
for(int x = 0; x < 5; x ++) {
System.out.println((“x,y”)+x + “,” + y);
}
}
//Traverses 2D-plane in one loopfor(int x = 0, y = 0; y < 5; x ++, y += (x==5?1:0), x %= 5) {
System.out.println((“x,y”)+x + “,” + y);
}

In the above single-loop, we see the commands taking place at the end of each loop. These will be processed left to right. We see that we started by incrementing x by 1. Then we optionally increment y by 1 if and only if x == 5. Then we finish by modulating x by 5. This method is how I ensure that the network sees the sample before it trains on the target distribution. Look for ‘lFit’ and ‘pFit’ at 234 and 245 below, respectively as we dive in.

To recap: we simply have a neural network designed to run and train on one mini-batch at a time. The output of the network will be a stream of streams where each interior stream is each network’s decision on one item. This final stream will have as many entries as the number of mini-batches fed to the network, and each interior stream will have as many answers to that query as there are networks. In this scenario, the mini-batch is represented by a series of input vectors in a matrix. In addition, we provide some functionality for utilizing several networks simultaneously (the build() function generates these given some Traits). This gives us the ability to do what I call ‘shotgun training’, where we develop BAGGING_CONST number of neural networks and train them all on the same data. I want to emphasize here that this method can be applied to any drop-in replacement of a neural network, which is why I’m not going to cover the interior of the network itself in this post. This construct also has three ways to sort the networks once some training is done. For one, they can be ordered in terms of their most recent performance so, if fed a full batch job with training data, they can measure their performance on it and sort accordingly. Another method I provided for ordering is based on lifetime performance, but utilizing this for fitness markers can cause problems. How much of a struggle has the network had in its lifetime? I would stray away from using lifetime performance to define the performance of the network for a few reasons. One issue with this approach is that the network may have started in a convenient spot in the loss landscape and done little work in approaching the minimum. In addition, typically a high lifetime loss and low loss on the last sample is indicative of a very trainable network which, while it was slow to take off (hence the high lifetime error), still made it significantly towards the minimum (low last target error). For this reason, I provided the functionality of ordering the networks based on ‘trainability,’ our third sorting method. The trainability score of each network is typically [-1, +1], with the lower score indicating greater trainability. Often, successful, or potentially successful networks are going to display trainability around or just under -1. The formula I use to calculate this can be found on line 403 in the trainability() method. Next steps for this project might include querying the severity of the deltas in the network to really gauge how each one is fitting, thus putting the more active networks at the front of the list. Until then, I use trainability.

Departing from the network itself, I will introduce what I call the ‘Farmer.’ The Farmer is the manager of generations of networks. It is what trains each stack of networks for a short period before selecting two winners to pass their genetic code to the next generation through a slight mutation function. While that is ongoing, the Farmer is also mutating and potentially crossing the genomes of the other networks that participated in the run in order to create the next generation. Keep in mind, it is not the primary goal of this method to train the networks to completion. Instead, it is useful in creating several unique networks and testing them against winners from the previous generation, thus automating our ‘shotgun training.’ Also noteworthy is the fact that even the winners of the generations only have their Traits recorded and are reconstructed and re-randomized for the next run. This ensures that any network to survive between several rounds is in fact trainable over the data set, and not just one that has simply seen more training instances than the others. Let’s take a look at the code:

There is a lot to this project that I have not presented here, so I will touch on some of it briefly now. The DNA class, for instance, simply contains an array of numbers while providing useful methods like mateWith() and copyPasta(). The mateWith() method takes another DNA strand and generates two children with the two parents. This keeps the population steady, unlike more biologically inspired methods. The copyPasta() method copies the strand of DNA while providing slight mutations. It is also used under the hood while calling mateWith(). Another important aspect of the project is the Trait object that I build from DNA and use to blueprint the networks. Each trait is constructed by a TraitFactory and is based on simple rules to read the DNA’s data array. These approaches may vary wildly and for that reason, I will leave them for you to interpret. My primary focus for the post was to provide a potential methodology for automated ‘shotgun training’ of any drop-in replaceable neural network in Java. Sadly, the current state of things may task you to create your own neural network for now. Happy hunting. :)

Thank you for reading.

--

--