convolutional-neural-network, crossver, fitness-function, genetic-algorithm, genetic-programming, hyper-parameter, multi-layer-perceptron, mutation, no-free-lunch, solution-space, thoughts, transformer

Wandering Thoughts on Self-Optimizing AI

Yesterday I came up to “Ignorant question” of my friend on Twitter as follows,

And, I just couldn’t ignore it and replied in a chain of tweets. Then I realized there are more details and thoughts I would like to put into it, so here we are… I will rephrase and reconstruct my tweet chain in a more structured form here.

Let’s start by comparing Convolutional Neural Networks (CNNs) to Multi-Layer Perceptron (MLP). Is CNN successful at every problem? Of course not, CNNs are computationally different from MLPs which allow them to gain computational and learning overhead compared to ANNs. <more detail ?> Without necessary computational paradigms it would be very difficult to mimic the functionality of a black box. Also, there is no ultimate learning algorithm, see the “no free lunch” theorem. Even if such an algorithm would have existed, in our context of the discussion, it would have to rewrite itself to gain new functionality. By saying “gain a new functionality” I consider changing its computational paradigm like going from MLP to CNN, or a more dramatic jump to “Transformer” models, or simpler ones such as Genetic Algorithms (GA)

Genetic Programming (GP) is about this kind of dramatic search. Basically, it starts with a random set of programs that may or may not work at all. It evaluates them according to a fitness function. It has no oversight over the evolution process other than fitness function. At this stage, the fitness function is one of the most critical parts, because if you can’t measure fitness, it does not matter how well your crossovermutation works, or, hyper-parameters are set. It looks possible with GP but only because generated programs run on a Turing machine equivalent of a computer. At this point some other concerns are involved such as; how expressive is language to allow GP to efficiently search solution space. Let’s sit back and reconsider where we are now. There is a problem and some learning algorithms (MLP, CNN,…etc.) trying to figure out that somehow, they are implemented in a language that GP uses to generate programs. Therefore, their loss function partially becomes the fitness function of the programs that GP generates at each generation. I said partially because “some learning algorithm” may or may not consist of an ensemble of learning algorithms. (Yeah it gets dirty and more problematic as we go deeper…This composition of the ensemble altogether is another problem but I will discuss it at the end.) The point is the loss function of one of them feeding other parts of the optimizers that generate them. Let’s go back to GP and the language it uses to generate those computational paradigms. Expressivity of the language itself is critical because if you try to make a CNN by flipping bits it would take a long time. I mean really long. I said bit flipping because if a GA would have liked some kind of data representation bit flipping would be their favorite because of operability (machine learning humor…). However, bit representations are not expressive and it may take forever to search a program…(or you may set termination conditions which is another detail about the Evolutionary Computation field, see Monte-Carlo vs Las Vegas algorithms). For that reason, the language that GP uses should be cannot be too complicated or too simple. For instance, see AI Programmerawesome demonstration of GA searching a solution space of programs written in “Brainf.ck” language. That language is simple enough to evolve. However, it may not be advanced enough to develop a convolutional net(it’s open to discussion…). However, the concept of “solution space” is relative, let me elaborate. Consider a terrain with mountains, if you have a car you have no chance but to go around mountains and where your car can go. This is your solution space when you have a car. However, if you had a helicopter you could go almost anywhere, that’s the solution space of the helicopter. I said “almost”, that is critical. Because, with a car, you can’t climb a mountain, with a helicopter you can’t land on a narrow valley, but if you were just a human! you could go anywhere in the solution space but it would take too much time and energy. Being human is like the bit flipping, highly precise, a simpler language is like being a car, and evolving an AI language like Tensorflow, PyTorch, or else is like a helicopter. Therefore, the language that GP uses should be evolving, preferably in a progressive way(because if not, then one would have to evolve a language and its compiler too!…A meta GP! Sounds exhausting already). 

It is easy to say, just to decide “elitism” in a GA/GP(I use them interchangeably for common terms) is problematic on its own! Let alone deciding which piece of code to keep to the next generation… Therefore, it will be problematic, so chances are to evolve a new language(Meta-GP), evolve existing language syntax (elitism and all other hyper-parameter arrangements), or build libraries(again elitism and customizing evolution process). These may look really brilliant but wait, no so fast… Evolution isn’t kind to its population. If you would consider evolving an existing language, that may mean losing some parts of the language entirely at some point in the evolution and GA may need to re-evolve them to achieve that task in the meantime. This concern is also applicable to building libraries in the evolution process. Why evolving language from scratch is better? It is not. But it is compatible with the evolutionary process when a language is no longer sufficient for producing necessary computational paradigms it will simply be replaced by new generations. However, evolving language is just another way of moving the complexity of evolving existing language and building a library to designing a new language process. It does not mean these problems are removed, they are just swept under the rug. As one can see there is a chain of optimization tasks intertwined. My take would be to build a library by playing on some form of elitism or evolving syntax(to be honest evolving syntax seems more elegant). 

If we look back 

  1. GP evolves computational paradigms for neural architectures(it does not have to be neural but I use the term in general) therefore the fitness function of GP is directly related to the performance of neural architectures.
  2. GP uses a language to evolve its computational paradigms, therefore if the neural net succeeds then both GP and the language it has used succeed together. Thus, evolved language’s(by evolved syntax) fitness function depends on the entire process described above. This means that the entire process has to run for every candidate language to evolve. And, this is a really big search problem.

Discussion

Some will say there are other ways of creating programs, “Transformer” architecture and its descendants are successful. Yes, they are, exactly. However, these giant models are starving for data. Let’s assume we did that. Then we no longer have the GP that evolves computational paradigms and related to that the meta-GP that creates languages instead we have a transformer. What did that transformer train with? A huge amount of existing code bases that we already know(“we” as humans in general) exists and are created. Therefore, it may succeed in forming novel compositions however, it will be providing solutions around the distribution of the sample set provided to it. For instance, if your problem isn’t solvable by existing solutions or paradigms then your transformer will be unaware of it. That’s the bias that we as humans generated. Evolutionary algorithms enforce diversity of the population, I can’t say it is perfect but it won’t be static like a pre-trained transformer model, it will be changing continuously but it may have more chance of discovering novel ways than a transformer model in my opinion. If someone with more in-depth knowledge elaborates further this would be a very nice discussion.

Conclusion

To make a long story short, I can say that the only way is to turn everything into an optimization loop of various algorithms, but all have some handicaps. I see no silver bullet for the problem. There could be one since we exist in the middle of the universe on a piece of rock at the bottom of a gas ocean haha :D. 

This text does not provide a guideline in any academic or industrial metric, I am just writing my thoughts around an idea. There could be other details that I could not remember on technical details at this time but I will refer back to this post and share those in the following posts. 

Standard

Leave a comment