LLMs excel at routine coding tasks, but evaluating their outputs is a System 2 problem. Sharing lessons learned from failed “LLM-as-a-judge” attempts - and how to build reliable evaluation instead.

In his book Thinking, Fast and Slow, Kahneman talks about two systems of thinking.

  • System 1 is fast - it handles the everyday stuff automatically, with little effort. It’s fast, it makes assumptions and it sometimes sacrifices accuracy in favor of just getting on with your everyday life.
  • System 2 is slow - it kicks in when something requires real focus and logic. It’s thorough, observing, it’s what you do when you need to make a turn and the traffic lights don’t work. You put 100% of our attention on our driving.

Kahneman claims we humans are built to avoid System 2 as much as possible. It’s “smarter” but it’s also a slow, deliberate, energy consuming and uncomfortable system. I tend to agree with him, at least from my personal experience, and I also believe the same principles apply when it comes to coding.

If I had to put a number on it, I’d say coding is 80% System 1 and 20% System 2. Most coding tasks are surprisingly routine: setting up boilerplate code, defining APIs, adding tests, or building side menus. You’ve done them a hundred times. Your brain’s on autopilot. And these are exactly the kinds of tasks AI can already do quite well for you.

But that other 20% - that’s where you actually have to think. That’s when you design a new algorithm, fix a deep bug, or make an architecture decision that changes how the system behaves. These are the tasks you hand off to your senior engineers, that’s where expertise shines.

Before LLMs became an everyday tool, writing unit tests and automations was definitely a System 1 task. Depending on the company and its policies, engineers would either allocate some time at the end of each task to write tests (and don’t lecture me on TDD, I’ve never seen one company do it effectively for long!) or hand things off to QA teams to implement them.

It was routine: go over the code, find edge cases, make sure everything’s reasonably covered.

But since LLM-based applications started taking over, this whole approach to testing began to shift. Because LLMs are non-deterministic, traditional unit and integration tests don’t really work anymore. You can still test the deterministic components of a system - but if your app gives different answers to similar inputs, your old test suite can’t give you that sweet 95–98% coverage that helps you fall asleep at night…

The need for evaluation

Unsurprisingly, this is also where a lot of teams get stuck.

Building that agentic AI POC is all fun and games, until you try to actually optimize the results or benchmark the quality of different versions. Then someone has to sit there manually reading through dozens of outputs, comparing them to expected answers, and deciding if they’re "close enough."

That’s a pure System 2 task - slow, painful, and deeply unpleasant. It’s subjective, hard to reproduce, and impossible to scale. The result is often poor quality, drifting slowly with time just for stakeholders to conclude that it’s all too unstable to release to production. Your project quietly gets buried until "something better" comes along. There’s no fun in that.

Enter LLM as a judge

But then comes the tempting shortcut:

"What if we just use an LLM to judge the answers?"

Let it tell you whether an output is correct, factual, or similar to what the product team expects. It does sound great - effortless, scalable, objective. A classic System 1 solution: do it once, get a number, move on. You can even get the nice dashboard with an average "correctness" score to keep everyone happy.

Unfortunately, both research and experience show the same thing - LLMs are terrible judges. They have a tendency to agree, to overestimate quality, and to reward answers that sound right over those that are right.

According to a recent paper, "LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks", which tested state-of-the-art models on a wide range of benchmarks,

"there is a substantial variance across models and datasets in how well their evaluations align with human experts."

Another comprehensive survey reviewing LLM-based evaluation methods found that their reliability is limited by

"the depth of domain-specific knowledge, limitations in reasoning abilities, and the diversity of evaluation criteria."

My own experience backs this up. When we tried to evaluate the performance of ScoutAI, the agent we developed for Max Security, a global intelligence provider, identifying a "correct" answer wasn’t a simple task - it needed to be concise, but still cover all key points from a given article.

We tried multiple strategies:

  • comparing the generated answer directly to the expected one,
  • extracting and matching key points from both,
  • and even using LLMs to grade based on coverage and relevance.

None of these worked reliably and our automatic evaluation reached only 70–75% accuracy compared to human judgment - meaning the model often rated an answer as "wrong" when the customer thought it was fine, or "right" when it clearly wasn’t.

Meanwhile, the system kept growing, users kept generating more outputs, and we still had no dependable way to measure performance without humans in the loop.

So what can you do?

A lot of the things I’m going to mention are classic solutions. There’s nothing novel about them - but they add stability to an otherwise non-deterministic problem.

Evaluate retrieval

Before there was semantic search and hybrid search, there was just search. And back then, search evaluation metrics were something most engineers could just ignore. As long, of course, as the right stuff showed up on the search page, everyone was happy.

But those metrics - like MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain) - are still incredibly useful today, in the context of RAG applications and tools. You can use them to ensure that your retrieval layer is getting the expected documents for a given query.

This step removes almost all of the AI-driven uncertainty from your scores and isolates the quality of your document processing and embeddings.

For example, let’s say you’re searching your company’s HR system for documents that answer the question:

"How do I submit an expense report?"

Evaluating retrieval allows you to verify that the relevant documents are returned - and that they appear in the correct order of relevance. And if your retrieval is solid, the rest of the flow (reranking, summarization, generation) has a much higher chance of producing a good answer.

In other words: Get your retrieval right first. It’s the most deterministic, measurable, and stable part of the system - and it’s the foundation of everything else.

Use LLMs in moderation

"It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail."

I guess it has to be said, because I’ve seen so many people get this wrong: use LLMs in moderation - and only when you actually need them.

Not every task needs a large model, and not every piece of logic needs to be "AI-powered."

  • Use regular keyword queries instead of semantic search where it makes sense.
  • Use rule-based parsers instead of asking an LLM to extract fields from structured data.
  • Use regex instead of an LLM to find files, logs, or patterns.
  • Use predefined query templates instead of generating SQL dynamically with an LLM.
  • Use math formulas to calculate numeric values like average revenue instead of an LLM.

LLMs are great at reasoning and language but they are not a replacement for good engineering.
If the task has clear rules, let code handle it. Save the model for when things get fuzzy, like routing, planning, or answer generation.

Unittest everything else

LLM systems still have plenty of deterministic parts - utilities, connectors, APIs, and data pipelines.
Make sure that anything that can be covered by unit tests, is.

For example, unit-test your prompt construction so you know the bugs aren’t coming from poorly formatted prompts or broken context assembly. When something goes wrong, you’ll want to be sure the problem is with the model, not your glue code.

Customer feedback

Always give users a simple way to rate or flag responses in any GenAI application. Thumbs up/down, a quick star rating, or a short comment box - it doesn’t matter. What matters is closing the loop.

Real feedback helps you spot regressions, measure perceived quality, and debug those mysterious "it used to work yesterday" cases. It’s the cheapest and most reliable evaluation signal you’ll ever get.

Review manually, but wisely.

Even if you can’t fully trust LLMs to be judges, you can still make human review efficient and meaningful.

Start by defining a reasonably sized but diverse evaluation set with a few dozen examples that represent the real range of your system: including different topics and edge cases. Then automate everything around it using good old CI\CD. Generate results automatically for every pull request, store outputs, build a simple review UI or dashboard and keep track of historic versions to keep track of changes over time. You can even run an LLM as a judge, just don’t use it as a single source of truth.

This way you can focus your energy on what matters - spotting real regressions and understanding why something improved or broke - instead of running blind tests or chasing random feedback.

Summary

So, has LLM-as-a-Judge actually worked for you?

I'm genuinely curious. For me, it's been a mix of good old discipline combined with tooling and automation: testing everything deterministic, building good eval sets, and making review easy and consistent.

The irony? We integrated LLMs into application to often avoid System 2 thinking, but evaluating them properly still requires it. As always in life, there's no magic solution, just deliberate engineering.