Reinforcement learning is so back, baby. OpenAI’s breakthrough ‘O’ models are “trained with large-scale reinforcement learning to reason using chain of thought.” DeepSeek’s even-more-breakthrough R1 was “trained via a 4-stage, RL heavy process” — they first trained a semi-standard large base model V3, and then “the improvement of R1 over V3 is the use of reinforcement learning to improve reasoning performance.”
Even more remarkable is that these improvements replicate on models small enough to run on your phone. It turns out “Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient” … and can be elicited from any model with more than ~1.5B parameters.
Let’s not forget it was RLHF — Reinforcement Learning from Human Feedback — which made modern AI a viable consumer product in the first place, by clamping down on models so they don’t generate alarming / offensive outputs … for better or worse. (This also limited their creativity and intelligence, and RLHF ‘alignment’ is an excellent example of the question “aligned with who, and which values, exactly?” … but while those are valid criticisms, the reality is there’s no way ChatGPT could have gone public without RLHF.)
Serious People now talk about using reinforcement learning for all manner of other kinds of model post-training and optimization. OpenAI’s Deep Research, “trained using end-to-end reinforcement learning,” is but one early example. Elsewhere, RL teaches robots to do perfect backflips. What’s most intriguing of all is—to my mind, anywasy—this new RL boom is only the first phase of another, larger, weirder one.
But before we get into that…
What exactly is reinforcement learning anyway?
Reinforcement learning is, at heart, good old trial & error. Take a neural network; give it some options; have it choose one; tell it whether its choice was good or bad; tweak its weights to increase its chance of choosing the good option; repeat a few billion times. If your training algorithm is good enough, if a path to the desired outcome exists, and you have the compute, then your model will find that path … or perhaps another path, one you never expected, desired, or even understand.
This is quite different from other forms of machine learning:
Supervised (learning patterns in datasets painstakingly created/labeled by humans)
unsupervised (recognizing patterns and clusters inherent in data)
and self-supervised (a kind of hybrid)
are all, at the end of the day, ways to teach models how to classify input data. Reinforcement learning teaches them how to act.
The RL optimization process is sometimes called hill-climbing; the model’s objective is the top of the mountain, and the crudest version simply keeps climbing the steepest slope it can find. (This may sound like gradient descent, used in backpropagation; it’s not dissimilar, but the important distinction is that you don’t have to mathematically calculate a gradient, you just need be able to score the possible actions.) If your RL algorithm is sufficiently subtle and thoughtful, your model will climb as relentlessly as Sisyphus, overcoming countless false summits, discovering hidden crevasses that lead to secret new heights, until it finally achieves the top of its world.
Sounds amazing, doesn’t it? But everything is harder than it sounds. Until recently RL (as reinforcement learning is universally known) was kinda 2010s. DeepMind used it for AlphaGo, and abandoned it in favor of the transformer. OpenAI experimented with it early on, too, and likewise largely rejected it … but in a very interesting way.
Let’s talk about unnatural selection
In April 2017, two months before the now-legendary paper “Attention Is All You Need” introduced the transformer, OpenAI published “Evolution Strategies as a Scalable Alternative to Reinforcement Learning,” in which:
We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular RL techniques … we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, [and] tolerant of extremely long horizons.
The great Lilian Weng expands on Evolution Strategies:
… one type of black-box optimization algorithms, born in the family of Evolutionary Algorithms (EA) … Evolutionary algorithms refer to a division of population-based optimization algorithms inspired by natural selection … Exploration (vs exploitation) is an important topic in RL. [A variant of ES] encourages exploration by updating the parameter in the direction to maximize the novelty score.
She goes on to cite the interesting POET paper:
An important question is whether [machine learning] problems themselves can be generated by the algorithm at the same time as they are being solved. Such a process would in effect build its own diverse and expanding curricula, and the solutions to problems at various stages would become stepping stones towards solving even more challenging problems later in the process. The Paired Open-Ended Trailblazer (POET) algorithm introduced in this paper does just that…
Me, whenever I see the phrase evolutionary algorithm, my ears prick up.
The legendary-if-only-to-me Thompson tone generator
Long ago, when I was a Waterloo undergraduate, a professor took ten minutes to tell us an engineering anecdote which has stuck with me ever since. Specifically, the tale of the Thompson tone detector:
Dr. Thompson dabbled with computer circuits in order to determine whether survival-of-the-fittest principles might provide hints for improved microchip designs … he employed a chip only ten cells wide and ten cells across— a mere 100 logic gates … cooked up a batch of primordial data-soup by generating fifty random blobs of ones and zeros … For the first hundred generations or so, there were few indications that the circuit-spawn were any improvement over their random-blob ancestors. But soon the chip began to show some encouraging twitches … Finally, after just over 4,000 generations, the system settled upon the best program … He pushed the chip even farther by requiring it to react to vocal “stop” and “go” commands, a task it met with a few hundred more generations of evolution.
Vaguely interesting, right? But now we get to the good stuff:
As predicted, natural selection had successfully produce specialized circuits using a fraction of the resources a human would have required. And no one had the foggiest notion how it worked. The plucky chip was utilizing only thirty-seven of its one hundred logic gates, and most of them were arranged in a curious collection of feedback loops. Five individual logic cells were functionally disconnected from the rest— with no pathways that would allow them to influence the output— yet when the researcher disabled any one of them the chip lost its ability.
The five separate logic cells were clearly crucial to the chip’s operation, but they were interacting with the main circuitry through some unorthodox method— most likely via the subtle magnetic fields that are created when electrons flow through circuitry, an effect known as magnetic flux. There was also evidence that the circuit was not relying solely on the transistors’ absolute ON and OFF positions like a typical chip; it was capitalizing upon analogue shades of gray along with the digital black and white.
That Thompson tone detector is is the closest thing to outright magic that modern engineering has ever discovered … at least until modern AI, with which it shares its frustrating inconsistency, lack of interpretability, and slightly eldritch vibe. As the product of an evolutionary algorithm ourselves (you know, the one called “evolution”) it behooves us to have considerable respect for the power of the approach … along with concern for its baffling incomprehensibility.
Can we use not just RL, but evolutionary algorithms, to improve LLMs? We can say with cautious but increasing optimism that the answer is yes. To the above papers I add this interesting recent Substack post, “How I came in first on ARC-AGI-Pub using Sonnet 3.5 with Evolutionary Test-time Compute,” by Jeremy Berman:
I was inspired by genetic algorithms and have been referring to this approach as Evolutionary Test-time Compute … LLMs can compensate for their generalization limitations through scaled test-time compute guided by evolutionary principles. By generating waves of candidate solutions and using their performance to guide further compute allocation, we can efficiently explore the solution space even with limited prior knowledge.
So what exactly are you suggesting?
I don’t pretend to be an AI researcher; just a (not-so-) humble AI engineer. That said, it seems very likely to me this reinforcement learning renaissance will presage a profusion of new attempts to incorporate evolution strategies—and evolutionary algorithms more generally—into LLMs, including both training and inference.
Speculating even more wildly; is this, perhaps, an aspect of what SSI is doing? Maybe! Ilya Sutskever was a co-author of that Evolution Strategies OpenAI paper. Though I wouldn’t be surprised if they were working on evolutionary algorithms, and recurrent neural networks, and GANs, all at the same time. That fireball called “transformers” has (understandably) consumed almost all the oxygen in the field over the last five years; but as and when they begin to S-curve, there are plenty of other approaches to be added to the arsenal.
Put another way: tasks which currently seem Sisyphean may instead become Icarean.
Evolutionary irreducibility
One quick coda, since I’m already speculating wildly about my pet obsessions. I can’t talk about evolutionary algorithms and not talk about cellular automata. I mean the following is word-for-word from my latest novel:
Conway's Game of Life, devised by mathematician John Conway on graph paper in 1970. In the game, as time ticked forward, the “life” or “death” of a square depended on very simple rules: live squares with two or three live neighbors survived; dead squares with three live neighbors came to life; and all others died. From these deceptively simple constraints, extraordinary complex systems emerged.
The more general term for such games, or systems, was “cellular automata.” They had fascinated mathematicians, engineers, and philosophers for generations with their ability to turn simple rules and patterns into astonishing complexity and chaos. It had been shown that the Game of Life, like most cellular automata, was itself a kind of programming language that could incorporate into itself anything expressible in software. Some avant-garde scientists such as Stephen Wolfram had even suggested, long before [spoiler elided], that our universe itself was a form of cellular automaton.”
At this point it is customary to start talking in a slightly wild-eyed manner about how the fundamental substrate of the universe might be more like software than hardware, and about how evolutionary algorithms might unlock the secrets of the universe. Sure, maybe. But actually I want to issue a related caveat: the study of cellular automata helps to show that there are hard limits to what we can do with software alone, no matter what magic we throw at it.
That hard limit is called computational irreducibility. As Wolfram puts it in “Can AI Solve Science?”, a piece which very explicitly reinforces Betteridge’s Law:
To this ultimate question we’re going to see that the answer is inevitably and firmly no … the point is that if we want to “solve everything” we’ll inevitably be confronted with computational irreducibility, and there just won’t be any way—with AI or otherwise—to shortcut just simulating the system step by step … discrete computational systems like cellular automata are rife with computational irreducibility.
Cellular automata are another form of evolutionary information, but one which show the limits of what we can do with AI, or any other form of computation; no matter how intelligent any entity is, there are hard limits to what it can do. If you ever feel like Sisyphus in an incomprehensible world, there is perhaps some slim consolation in knowing any future superintelligences will likely still feel the same.