On Rigor and LLMs
You may think LLMs are chaotic and nondeterministic. Oh, you sweet summer child.
Of Order and Chaos
Software is deterministic. If you write a program, barring some freakish one-in-a-billion hardware glitch, your hardware will run it in exactly the same way every time. This consistency is what allows the massive Tower of Babel that is modern software to work at all. Your webapp’s React code relies on its shadow DOM code, which relies on the JS engine, which relies on its browser APIs, which rely on TCP/IP’s seven layers, elying on the one beneath, all of the above relying on countless libraries and helper functions, each in turn relying on the OS and the database, everything ultimately relying on compilers and thence microcode. This web of reliance is possibly only because software is, again, deterministic.
LLMs are nondeterministic. Even with temperature 0 they may not respond to the same prompt with the same output — and we deliberately keep temperatures well above 0. Even if they were deterministic, unlike software, there is generally no fixed or limited set of inputs; each prompt is different, anything might go into them, and thence through the black-box neural networks of embeddings & attention heads & skip layers. LLMs are stochastic: their outputs follow unpredictable probability distributions rather than reliable algorithms. They are, again, nondeterministic.
How can you do rigorous engineering with such unreliable, unpredictable systems? The very idea makes a lot of people throw up their hands in dismay and disgust. (In my experience most highly technical LLM-haters have this at the root of their hate.) You don’t know what they do. You can’t know what they do. So how can you build reliable systems, with consistently good outputs, using LLMs as your building blocks? They’re not like software! They’re not deterministic! All is chaos!
...the thing is, though, the same is actually true for software.
Actually Everything Is Chaos
There was a little bait-and-switch above, a rhetorical magician’s trick. Yes, the hardware will run your software in the same way every time ... but the software you first give it is always wrong! When you write code, the outputs are not deterministic in any way that is meaningful to the author. Instead the outputs are truly random… because code has bugs, and you don’t know where those bugs are, or what they do. Software is deterministic in principle, sure. But software in practice is actually far more chaotic than LLMs!
This is also true — whisper it — of hardware itself. Every CPU and GPU is flawed at some level, because some microscopic impurity, even if only a few stray atoms, crept in when it was etched. Do these flaws actually matter, for any given chip? Good question! And the answer is actually ... more or less ... random. That’s right: even our hardware is more chaotic than LLMs!
And yet. We do rigorous engineering with both. How?
Eval-Driven Development
The answer is simplicity itself: we test. An sizable amount of the real estate of every GPU and CPU, along with some 30% of the costs of making them, is devoted to test circuitry whose only function is to answer the question of whether this chip actually works. Software engineers write unit tests, end-to-end tests, functional tests, etc.; run batteries of such tests on every single build; and also simply play around with whatever they just wrote — call that “vibe testing.”
Once upon a time, not so long ago, vibe testing was all that engineers did, and that was if you were lucky. Automated tests were almost unheard of outside of specialized applications. Testing was done by batteries of QA people who were considered highly fungible, relatively unskilled labor. It took decades — generations — for the industry to promote testing to its rightful primacy. OK, nobody actually does so-called “test-driven development,” but approximately everybody in software now agrees that testing is vital, writes tests, runs tests, and in general treats tests as first-class software-engineering citizens.
...Alas it kinda looks like we’re recapitulating that same history for AI engineering. Does anybody really do “eval-driven development”? No. Which is fine. But do people take evals - that is to say, rigorous, verifiably high-signal, quantifiable evaluations of everything LLMs generate at any point in your system - as seriously as they take software tests?
In my experience, they do not ... or least, it is not yet instinctive / automatic. It’s only a matter of time until it is; but that period will be quite painful for all concerned, just as the pre-“tests: very important, actually” period was quite painful in software.
EDD for You and Me
Evals for LLMs are actually much more important than tests for software. When software goes wrong, it’s usually in a very obvious way; systems crash or emit garbage. When LLMs go wrong, it’s usually because they generate useless slop that looks plausibly like valuable outputs ... and then they argue convincingly that it’s the latter.
There are a few approaches to determining, rigorously, whether you should believe:
1. Double- / Triple-Checking
This is pretty simple. If you want an LLM to read a long transcript and summarize the important facts, how do you know it reported a) all the important facts b) facts which were actually there, not hallucinated? A simple approach is to ask two different LLMs to do this, highlight divergences between the two, and then resolve their differences. LLMs can (albeit only quite rarely for in-context learning) hallucinate and miss key details, but two different LLMs doing so in exactly the same way is very unlikely, and if you want to make it vanishingly unlikely, use an ensemble of three or more.
2. Real-World Validation
Take, for instance, my former employers FutureSearch, who built (and build) AI research agents. If you want to research a question like, “How many nuclear reactors were built worldwide in the 1990s?” or “What nation has the highest GDP per cow?”, you can ask such an agent, and it will come back with answers that sound pretty plausible. But will they be correct? How can you tell whether the research agent you’ve built is actually any good?
Here your outputs are amenable to real-world evals. Build a large battery of such questions - an evaluation set - research the correct answers yourself, run your agents against them, and score how well they did. This process is replete with its own difficulties (definitions and edge cases, the answer changing month-to-month, availability of research materials, etc...) but the rigor is clear. You construct a large, diverse dataset of problems that you want your LLM-powered system to solve, and you score how well it solve those problems. This gives you a reliable, high-signal measure of how good your agent is.
(How good does it need to be? Really depends on the problem. A lot of the time the huge advantage you get from LLMs is that verifying an solution is vastly easier than finding one, in which case “good enough” can often save you enormous amounts of time and money. If you’re building an autopilot, God forbid, not so much.)
3. Well-Calibrated Judges
Sometimes, though, you don’t have gold-standard real-world data; sometimes you’re asking your LLMs to exercise judgement. Fear not! There is a rigorous solution to this too. First, use an LLM-as-judge, i.e. write a “rubric” - a prompt with which one LLM judge the output of another in a quantified way, usually by scoring various criteria on a Likert scale. So far so good, but how can you trust it?
You can’t ... unless it’s calibrated. Construct another broad eval dataset, this time of inputs that require judgement. (Synthetic data is generally fine and good here.) Then, for each item in that dataset, have human beings - reasonable expert ones, as relevant to the field - judge those inputs on a Likert scale. Score that human data; it’s gold. And, uh, make sure those humans generally agree with each other!
Then, simply, compare the human results to your LLM-as-judge results — your first attempt at this will likely be a very vivid example of why you are doing this — and iterate on your LLM-as-judge prompt until it is well-calibrated, i.e. generally agrees with most of your humans most of the time. (Needless to say there are rigorous ways to measure this.) Then, and only then, you can be confident you have a high-signal evaluation with good judgement. How well and how consistently does it have to reflect them? Again, depends on what you’re doing. The more evals work you do, the better they’ll get, likely with diminishing returns ... a classic engineering trade-off.
Worth noting that such evals can morph into reinforcement learning with surprising ease ... but also that reinforcement learning is really really good at reward-hacking. (In fact RL should arguably be renamed “reward hacking,” to highlight that you have to be certain you are only rewarding the model you’re training for good behavior, or else you will most certainly reward it for subtly bad.)
EDD Is Not Enough
Approaches like the above are necessary for every single LLM output in your system. Sorry. But come on now. Would you look at an entirely untested block of code in your mission-critical pipeline and think “eh, probably good enough”?
Necessary … but not sufficient. Which should surprise no one. When a new major model is released, people flock to its benchmark scores to get a sense of it, but they know they can’t really say anything about it until they (or trusted others) “vibe test” it. You cannot trust the benchmarks until you verify the vibes.
As with LLM models, so with LLM systems — and as with software, where you must “smoke test” what you just wrote even if all automated tests pass. (And as with life; if The Statistics say something, but Your Experiences say something very different, it’s possible you’re experiencing extreme selection bias, but you shouldn’t rule out the possibility that your vibe-testing of reality is showing The Statistics to be skewed or hacked.) The outputs of your evals are extremely useful and important, and can even be a basis for enormously powerful automated iteration ... or reinforcement learning ... but you still gotta vibe-check them on the regular, too.
Sorry. But that’s all part of the rigor. And one thing you’ll learn very quickly, when building LLM systems, is that, like all engineering, in practice you won’t get anywhere meaningful without rigor. That’s what engineering ultimately is.


