Rethinking best_of in GPT-3: Why It Misleads
Relying on best_of to improve LLM accuracy is misguided; the practical fix is to define clear task boundaries with better prompts and use outlier examples to ground interpretation, which can let you use smaller models and single-shot prompts while reducing cost.
I was once asked to help a very large tech company—whose name will remain anonymous—use GPT-3 inside a workflow where they were trying to classify certain categories of text.
A team had been working on it for a while and got stuck, so I got pulled in to help unstick them. “Sent in” in this case just meant hopping on a video call. There were four engineers, all working remotely, and they’d configured the system to use GPT-3 with best_of=5.
If you’re not familiar with best_of: it means the model generates five completions, and then you pick the one with the highest score based on token log probabilities—basically, you sum up how likely the model thought each token was, and the completion with the best overall likelihood “wins.”
And I could tell immediately they had a fundamental problem, because in my opinion best_of is just a terrible way to try to get a better answer out of an LLM.
All best_of really does is give you the most statistically likely completion among several samples. People see the phrase “best of” and assume it means “best answer,” but it doesn’t. It means “highest likelihood under the model,” which is not the same thing as “the outcome you want.”
They were running this over and over again—using the biggest GPT-3 model at the time (DaVinci), plus best_of—and still couldn’t get results they considered good enough.
Here’s why that approach tends to fail: if the most deterministic, highest-probability answer isn’t correct for your task, then sampling multiple times and selecting the most likely sample usually doesn’t rescue you. In most cases, if you want more consistency, you lower temperature and make the model more deterministic. If that deterministic behavior is still wrong, it’s almost always a prompt/specification problem, not something best_of is going to fix.
So I looked at what they were doing. The prompt was basically a thin instruction to a classifier—something like, “determine the sentiment” (hypothetically), with very little guidance. The issue is that tasks like that are often subjective. Is the response sarcastic? Is it serious? Is it playful? Is it hostile? Without an operational definition, you’re not really asking for classification—you’re asking for an interpretation.
The fix was not “more sampling.” The fix was: give the model better boundaries.
In practice, the most reliable way to do that was to provide examples. And the way I like to use examples isn’t to give it a bunch of lookalike cases that it’ll overfit to. Instead, I prefer to give outlier examples—an extreme on one end and an extreme on the other end—so the model has a clear sense of the spectrum, and then it can place real inputs somewhere between those poles.
Once we rewrote the prompt that way, everything changed. They didn’t need DaVinci. They didn’t need best_of=5. We could use a much smaller GPT-3 model—in this case Babbage—run it single-shot, and get a better outcome than what they’d been getting before.
And from a cost perspective, that mattered: their original setup was dramatically more expensive than it needed to be. They were essentially paying for brute force—multiple generations with the largest model—when the real lever was prompt design.
But the bigger takeaway for me had nothing to do with “use a smaller model” or “don’t use best_of.” It was about mindset.
I was working with engineers who were used to deterministic systems. In their world, you call a function, assign a variable, generate a random number—whatever it is—and it behaves in a predictable way. Language models aren’t like that. They’re fuzzy because language is fuzzy. And on top of that, English was a second language for them, which made it even more frustrating to try to coax useful behavior out of a raw GPT-3 base model.
That was one of the real pain points of the GPT-3 era: you had to bring a lot of linguistic intuition to the table. You couldn’t just write a clean spec and expect the model to snap to it. You often had to “live in prompt space” for a while—play with wording, try different framings, add examples—until the model finally latched onto what you meant.
And I think one of the biggest breakthroughs since then hasn’t only been that the models got smarter, or that they became easier for average users. It’s that they became easier for people with an engineering mindset—the kind of person who expects a spec sheet to tell you exactly what something will do—because spending hours doing creative language iteration to find the right prompt just isn’t playing to their strengths.