convolutional-neural-network, crossver, fitness-function, genetic-algorithm, genetic-programming, hyper-parameter, multi-layer-perceptron, mutation, no-free-lunch, solution-space, thoughts, transformer

Wandering Thoughts on Self-Optimizing AI

Yesterday I came up to “Ignorant question” of my friend on Twitter as follows,

Ignorant question: Could it be possible to fake all cognitive actions of a brain with artificial neural networks even though without the knowledge of its all parts and their implementations? #cognition #ArtificialIntelligence #AI #neuralnetwork #cognitivescience
— Dasein oder pip install sein (@InFurs) September 23, 2022

And, I just couldn’t ignore it and replied in a chain of tweets. Then I realized there are more details and thoughts I would like to put into it, so here we are… I will rephrase and reconstruct my tweet chain in a more structured form here.

Let’s start by comparing Convolutional Neural Networks (CNNs) to Multi-Layer Perceptron (MLP). Is CNN successful at every problem? Of course not, CNNs are computationally different from MLPs which allow them to gain computational and learning overhead compared to ANNs. <more detail ?> Without necessary computational paradigms it would be very difficult to mimic the functionality of a black box. Also, there is no ultimate learning algorithm, see the “no free lunch” theorem. Even if such an algorithm would have existed, in our context of the discussion, it would have to rewrite itself to gain new functionality. By saying “gain a new functionality” I consider changing its computational paradigm like going from MLP to CNN, or a more dramatic jump to “Transformer” models, or simpler ones such as Genetic Algorithms (GA).

Genetic Programming (GP) is about this kind of dramatic search. Basically, it starts with a random set of programs that may or may not work at all. It evaluates them according to a fitness function. It has no oversight over the evolution process other than fitness function. At this stage, the fitness function is one of the most critical parts, because if you can’t measure fitness, it does not matter how well your crossover, mutation works, or, hyper-parameters are set. It looks possible with GP but only because generated programs run on a Turing machine equivalent of a computer. At this point some other concerns are involved such as; how expressive is language to allow GP to efficiently search solution space. Let’s sit back and reconsider where we are now. There is a problem and some learning algorithms (MLP, CNN,…etc.) trying to figure out that somehow, they are implemented in a language that GP uses to generate programs. Therefore, their loss function partially becomes the fitness function of the programs that GP generates at each generation. I said partially because “some learning algorithm” may or may not consist of an ensemble of learning algorithms. (Yeah it gets dirty and more problematic as we go deeper…This composition of the ensemble altogether is another problem but I will discuss it at the end.) The point is the loss function of one of them feeding other parts of the optimizers that generate them. Let’s go back to GP and the language it uses to generate those computational paradigms. Expressivity of the language itself is critical because if you try to make a CNN by flipping bits it would take a long time. I mean really long. I said bit flipping because if a GA would have liked some kind of data representation bit flipping would be their favorite because of operability (machine learning humor…). However, bit representations are not expressive and it may take forever to search a program…(or you may set termination conditions which is another detail about the Evolutionary Computation field, see Monte-Carlo vs Las Vegas algorithms). For that reason, the language that GP uses should be cannot be too complicated or too simple. For instance, see AI Programmerawesome demonstration of GA searching a solution space of programs written in “Brainf.ck” language. That language is simple enough to evolve. However, it may not be advanced enough to develop a convolutional net(it’s open to discussion…). However, the concept of “solution space” is relative, let me elaborate. Consider a terrain with mountains, if you have a car you have no chance but to go around mountains and where your car can go. This is your solution space when you have a car. However, if you had a helicopter you could go almost anywhere, that’s the solution space of the helicopter. I said “almost”, that is critical. Because, with a car, you can’t climb a mountain, with a helicopter you can’t land on a narrow valley, but if you were just a human! you could go anywhere in the solution space but it would take too much time and energy. Being human is like the bit flipping, highly precise, a simpler language is like being a car, and evolving an AI language like Tensorflow, PyTorch, or else is like a helicopter. Therefore, the language that GP uses should be evolving, preferably in a progressive way(because if not, then one would have to evolve a language and its compiler too!…A meta GP! Sounds exhausting already).

It is easy to say, just to decide “elitism” in a GA/GP(I use them interchangeably for common terms) is problematic on its own! Let alone deciding which piece of code to keep to the next generation… Therefore, it will be problematic, so chances are to evolve a new language(Meta-GP), evolve existing language syntax (elitism and all other hyper-parameter arrangements), or build libraries(again elitism and customizing evolution process). These may look really brilliant but wait, no so fast… Evolution isn’t kind to its population. If you would consider evolving an existing language, that may mean losing some parts of the language entirely at some point in the evolution and GA may need to re-evolve them to achieve that task in the meantime. This concern is also applicable to building libraries in the evolution process. Why evolving language from scratch is better? It is not. But it is compatible with the evolutionary process when a language is no longer sufficient for producing necessary computational paradigms it will simply be replaced by new generations. However, evolving language is just another way of moving the complexity of evolving existing language and building a library to designing a new language process. It does not mean these problems are removed, they are just swept under the rug. As one can see there is a chain of optimization tasks intertwined. My take would be to build a library by playing on some form of elitism or evolving syntax(to be honest evolving syntax seems more elegant).

If we look back

GP evolves computational paradigms for neural architectures(it does not have to be neural but I use the term in general) therefore the fitness function of GP is directly related to the performance of neural architectures.
GP uses a language to evolve its computational paradigms, therefore if the neural net succeeds then both GP and the language it has used succeed together. Thus, evolved language’s(by evolved syntax) fitness function depends on the entire process described above. This means that the entire process has to run for every candidate language to evolve. And, this is a really big search problem.

Discussion

Some will say there are other ways of creating programs, “Transformer” architecture and its descendants are successful. Yes, they are, exactly. However, these giant models are starving for data. Let’s assume we did that. Then we no longer have the GP that evolves computational paradigms and related to that the meta-GP that creates languages instead we have a transformer. What did that transformer train with? A huge amount of existing code bases that we already know(“we” as humans in general) exists and are created. Therefore, it may succeed in forming novel compositions however, it will be providing solutions around the distribution of the sample set provided to it. For instance, if your problem isn’t solvable by existing solutions or paradigms then your transformer will be unaware of it. That’s the bias that we as humans generated. Evolutionary algorithms enforce diversity of the population, I can’t say it is perfect but it won’t be static like a pre-trained transformer model, it will be changing continuously but it may have more chance of discovering novel ways than a transformer model in my opinion. If someone with more in-depth knowledge elaborates further this would be a very nice discussion.

Conclusion

To make a long story short, I can say that the only way is to turn everything into an optimization loop of various algorithms, but all have some handicaps. I see no silver bullet for the problem. There could be one since we exist in the middle of the universe on a piece of rock at the bottom of a gas ocean haha :D.

This text does not provide a guideline in any academic or industrial metric, I am just writing my thoughts around an idea. There could be other details that I could not remember on technical details at this time but I will refer back to this post and share those in the following posts.

automatic-differentiation, julialang, projects

A Little Update for AutoDiff.jl

Recently, I have blogged about Automatic Differentiation (AD) and my motives for developing an AD library to experiment on it. Even though I have recently made it public, I have been investigating other Julia packages about AD, especially how they define the differentiation rules. If a Julia package isn’t designed for a very specific task, then it probably consists of multiple sub-packages that support its functionality. Julia language has an ecosystem where packages are designed as multiple sub-packages to handle part of the requirements. This has two positive effects of the Julia package ecosystem; one may find a package for various needs, every project contributes to this ecosystem with its sub-packages which is good for the open-source ecosystem (There are nice tips and guides on the Julia Blog).

What is my point? The point is some of the earliest AD libraries of Julia are using their own set of differentiation rules. They are also providing ways to define custom differentiation rules and avoid automatic differentiation process. It is good, however, knowing that the behavior of the AD process may change every time one switches a library isn’t very intuitive. On the other hand, it is not easy to integrate a library to work with another. Because the focus of the libraries vary; they may be focusing on limiting features and functionality to gain speed.

Since I have learned more about Julia’s ecosystem, I have started digging into collections maintained by Organizations on GitHub. JuliaDiff is one of them. It is really nice to be able to find most of the packages developed for a purpose, under the same roof. There are multiple Julia packages for AD. There are also very well documented packages that do AD tasks. However, they seem to opt-out modularity. I have been looking into DiffRules.jl for a while. They have a very nice set of differentiation rules, also ways to define custom rules. However, most importantly they return symbolic derivations for operations. This makes things extendable… Some of the AD packages I have mentioned so far(without giving names) also modular(in their own way…) but not using already existing DiffRules.jl package, which is only causing the same differentiation-rules to be re-defined again every time someone develops a new library(for example; I couldn’t use a very popular one because it wasn’t extendable). I believe DiffRules is generic enough, since it return derivative function symbolically, any library may have built their own upon them or change them on package initialization(like I did this time). However, most did not…

As I have published AutoDiff.jl library, I have also re-defined a small subset of functions(Because this was just an experimental project on mine). However, in order to be functional, more functions needed to be defined. As one can guess I did not define them by hand… Because there is DiffRules.jl. Instead, I have followed some kind of a meta-programming approach. I was able to generate AutoDiff.jl compatible rules out of every rule defined in the DiffRules.jl. This means now, AutoDiff.jl potentially much more capable(as much as DiffRules.jl allows at least). An interesting thing is that only 64 lines of code were enough for it to maintain its existing functionality and expand on it…

The latest changes are merged into the master branch since I have not observed any loss of functionality.

automatic-differentiation, julialang, projects

Automatic Differentiation

Automatic Differentiation(AD) is a very popular need in the rise of machine learning applications. Since my interests are in Machine Learning(ML), I will be judging AD from this perspective. Many frameworks have their way of calculating gradients. Some automatically calculate them but restrict users to use only control structures and operations that they have provided, and some allow users to define derivatives. However, both approaches are restrictive. The dream is to have the language handle it like all other arithmetic operations. However, it wasn’t possible because it wasn’t the main concern, until now. In recent years, Google has considered Swift language for the base to migrate its framework, Tensorflow. One of the main reasons for that is the Swift is JIT compiled by LLVM and it is a mainstream language for developers (I know Swift is used for IOS development, and I don’t know if there are any other fields it is used in…). Without judging the selection of the language, I would like to point out that, Tensorflow is being integrated into the Swift language, which is good, exciting. However, until it is ready for public use it will be out of reach of the individual users where they have to fix every bug coming in their way and have to implement a feature if it does not exist(It may have progressed my observation was almost a year ago). From this approach, if Swift or Tensorflow does not support a feature that I would like to use in the research I must change Swift or Tensorflow base. Both of these projects are massive, wandering around with the research ideas and having to understand, develop/fix, compile, and test in this massive frameworks almost intractable challenge. Before I jump into the second part, I must mention that I’ve used Tensorflow in my Master’s Thesis and tried Swift for Tensorflow, but, that was the end. Because, Swift for Tensorflow wasn’t mature enough, and even some of the basic examples shown in the keynote wasn’t working at the time. Disclaimer, I think it is a great and exciting project that will allow ML to move into new horizons with ease, someday, however, I needed a working language, which I can change, fix if necessary. This is where Julia comes into the stage. Julia is also a language compile JIT, by LLVM. However, this time focus is scientific, every library is written in Julia so that one can even fix a bug if it is in the library and run it again. Which was like a dream, in the search of a base language where I will implement and experiment with my research ideas. Second good thing is that Julia was already into the integration of AD. Because, the first language compile JIT, and heavily allowed meta-programming, which I think more extensive than the Python has provided. I am not denigrating Python, I’ve used it for a long time and did great projects (involving meta-programming) where I could never have done in Java or C(that easy at least). Julia had many AD libraries when I started using it. Some of them were recording operation on tape and recompiling it so that it can run fast(and it does I assure you). However, this would eliminate the possibility of differentiating through loops and conditions which I needed most. I could use it to generate reverse differentiate each time however, this would be slow too. When you want to back-propagate a lot, it starts to matter… Then I’ve found another, which was already integrated to an ML library(I am saying library on purpose, because, Julia is kind of allergic to frameworks, the motto is to have libraries not frameworks, at least that was the gist that I got…). My initial impression is, it was great, I’ve never been that happy during my short research career. Freedom, functionality, and speed at the same time. One doubts when they hear such a sentence and think too good to be true?. It is true, however, at a cost. When everything is a library, you need to optimize your code, since you got your freedom back from the framework which speeds it up for you… It wasn’t a big problem for me because I am using it for research and slowdown isn’t much and I can optimize the parts of the way I like it and speed it up. I must mention that slow of Julia gonna be way faster than average slow one can imagine… With this, I have moved along and learned a lot of things by implementing actual code rather than using constructs of a framework. However, with that happy that I am creating my libraries and integrating them in ease, I thought that I would never come to the point where I need to create my AD library. Well, I came to that point, and yes, I have created my AD library. It was a lot easier than what I thought a year ago.

I decided to build one when I watched one of the talks of Prof. Edelman, “Automatic Differentiation in 10 minutes with Julia”. Which was very quick, unexpected and functional implementation of Forward Differentiation using the concept of DualNumbers, which is an extension of Real Numbers that allow derivative of operations to be calculated simultaneously. It was a very impressive presentation that show the power of the Julia on the stage even though professor went modest with it. With that inspiration I created a type called BD(for reverse differentiation), it is not exactly Backward-Dual or something similar. I considered it more like the Backward-Derivation. What I have in mind is to both run operations forward and create a function to calculate reverse-differentiation of the operation that run at the time. Therefore, as the operations are come after one another, they will create a network of linked functions programmatically. I all had to do was implement necessary interfaces for my new type. They were very basically, type promotion, indexing, broadcasting, arithmetic operations and some special functions. I haven’t implemented them all but even in this stage it works impressively(in my opinion of course). Another disclaimer, I am not a mathematican, or a person with their primary interest in AD. But, trust me I am an engineer…

Link to the GitHub page of my AD package

mertceylandotme

Don't let the perfect be the enemy of the good. (Voltaire)

Wandering Thoughts on Self-Optimizing AI

A Little Update for AutoDiff.jl

Automatic Differentiation