Small Model Advantages: When Smaller LLMs Outperform Bigger Ones

For large documents, extracting key points, phrases, and entities with a small model is cheaper, faster, and often more reliable than generating a full summary.

I was looking through my notes recently and found an example from early 2020 that still feels surprisingly relevant today.

We were helping a client who wanted document summarization. The catch was cost. Back then, taking a “large” document—say ~10,000 words—could easily run around ~20,000 tokens once you included prompt + output. And this was the era when GPT‑3 pricing was $0.06 per 1,000 tokens for the best model (Da Vinci).

So the math got painful fast.

A single run to summarize a big document could cost about:

Da Vinci: ~$1.14
Curie: ~$0.11
Babbage: ~$0.023
Ada: ~$0.016

That’s not a rounding error difference—that’s a completely different product decision.

The realization: you often don’t need a summary

At some point I realized something that sounds obvious in hindsight:

For many “summarization” use cases, what you actually want isn’t a rewritten version of the text at all.

What you want is:

key phrases
key points
maybe a couple of representative sentences

In other words, you want extraction, not generation.

There have been classic NLP methods for this forever—TF‑IDF, keyword extraction, sentence scoring, etc. But they weren’t always robust, and they could be brittle across domains.

What surprised me in 2020 was that even a small model like Ada—which was about 100× cheaper than Da Vinci—was perfectly capable of reading a big body of text and pulling out the essence reliably.

I remember testing it with a passage from Wikipedia about Dracula, and it would confidently return things like:

Dracula, vampires, horror, fiction

That sounds trivial, but it was actually extremely useful. And in many cases, it was more useful than a summary, because it got straight to the “heart of it” in a way you could immediately reuse for:

categorization
search indexing
clustering documents
routing (which team should see this?)
analytics dashboards (“what are people talking about?”)

An email from back then (lightly cleaned up)

I found an early email I sent at the time. It captures the idea pretty well:

An alternative for summarization is to use a smaller model like Curie to do key points extraction.

Instead of trying to decode text and then form it back into a summary, you can use Curie with a few examples to pull out the critical parts.

If we wanted to run a query and pulled data straight from Wikipedia pages here are the estimated costs:

Cost estimate
Subject: Dracula
Keywords: Dracula, vampires, horror, fiction
Total words: 11,249
Tokens: 19,844

Cost per engine:
Davinci: $1.14
Curie: $0.11
Babbage: $0.023
Ada: $0.016

Why extraction can be better than summarization

Summaries are seductive because they feel like the “full solution.” But they introduce problems:

You’re generating new text, which increases the chance of hallucination or subtle distortion.
You have to decide: short summary? long summary? tone? format?
You may lose structure that you actually wanted (entities, topics, labels).

Extraction is a lot more grounded. You’re not asking the model to invent prose—you’re asking it to point at what’s already there.

Even if modern models are far better at summarization than they were in 2020, extraction still has a nice property:

it’s simpler
it’s faster
it’s cheaper
and it’s often easier to validate

The funny part: today this is almost free

What’s pretty amusing now is that the task that felt expensive with Da Vinci in 2020 would be extremely cheap today with a state‑of‑the‑art model—not only because models got more capable, but because they got dramatically more efficient.

I don’t think most of us expected that. I think we assumed capabilities would improve and costs would sort of stay in the same ballpark.

Instead, what happened was closer to:

capabilities improved dramatically
and costs fell exponentially

So today, with something like a very small modern model (e.g., a “nano” tier that’s still competent at reading lots of text), you can process large documents and extract key points for a fraction of what you would have paid years ago.

And importantly: you don’t need to have it write a whole new body of text. You can just have it extract the handful of things you need, reliably, without introducing extra creative surface area.

Practical takeaway

If you’re building something and you think you need summarization, it’s worth asking:

Do I actually need a rewritten summary?
Or do I need tags, topics, entities, key phrases, bullet points, or key sentences?

Because if it’s the second one, you’ll usually get:

lower cost
lower latency
less risk of hallucination
and output that’s easier to plug into downstream systems

Sometimes the best “summary” is just the right set of extracted points.