Cross-Temperature Hallucination Testing for Sanity Checks

Cross-check AI outputs by comparing responses across temperatures and against smaller models to quickly flag hallucinations and verify with real sources.

Hallucination has become much less of a problem with AI models as they’ve become more generally intelligent, understand more about the world, and can use tools like web search and databases to keep their answers consistent. It’s not a problem that’s gone completely away, but it’s nowhere near the issue that it used to be.

A few years ago, when hallucinations could cause real problems, there were some genuinely fun tricks you could use to reduce them—or at least mitigate them. One of my favorites, especially for catching “hallucinated facts,” was basically to make the model check itself by comparing answers across settings and even across model sizes.

Here’s the idea.

Take a model like GPT-3 DaVinci and ask it a question. For example:

Can you describe a magic trick by Harry Houdini involving a trunk?

Then generate two outputs:

  • one at a low temperature (more conservative, more likely to stick to what it “knows”)
  • one at a high temperature (more creative, more willing to fill gaps)

Often, if the model has no idea what the real answer is, the high-temperature output becomes more likely to make something up. That’s the whole point of the test: you’re not just looking at the content, you’re looking at whether the model stays consistent with itself when you change the creativity dial.

There are questions where this won’t tell you much. If you ask:

What year did we land on the moon?

you’re generally going to get “1969” even at a higher temperature, because that’s a well-known fact and deeply represented in the model’s training data.

But if you ask something like:

What year did we land on Mars? Did humans land on Mars?

you might get an answer anyway—even though it’s a trap. And in that situation, the low-temperature answer and the high-temperature answer are often going to diverge. That disagreement is a useful signal.

When temperature doesn’t catch it: use a smaller model

Sometimes the model will stay consistent across temperatures, even when it’s wrong. In those cases, the next move I liked was to bring in a smaller model.

Back in the GPT-3 days, you could take DaVinci and compare it with Curie, a smaller version of GPT-3, and ask the exact same question. This was great for detecting when a model was confidently producing an answer that didn’t really exist.

The reason this works is that different-sized models tend to hallucinate differently. The smaller model might:

  • produce a much less detailed answer (because it can’t elaborate as well), or
  • oddly enough, still be extremely verbose, but in a different direction

Either way, if you’re looking for a quick sanity check, disagreement between a bigger model and a smaller model is another strong clue that you’re in hallucination territory.

Why this method still matters

Even today, this technique can still be useful for certain kinds of hallucinations: check whether the model disagrees with itself at different temperature settings, or whether it disagrees with a smaller version of itself (or an older version).

It’s not a perfect solution, and it doesn’t replace verification with real sources. But as a practical, lightweight test—especially when you’re trying to figure out whether you’re dealing with an actual fact or an invented one—it’s surprisingly effective.