Grounding Prompts with Wikidata and SPARQL

Ground model outputs in Wikidata by constructing SPARQL queries with correct property and entity IDs, optionally aided by a lightweight query generator or retrieval workflow, to fetch real data and reduce hallucinations.

The life of a prompt engineer is spending a lot of time trying to get a model to do one specific thing—only to have the next update do it effortlessly. It’s frustrating, but also kind of fun. The goal is progress.

And honestly, I don’t think I’ve ever really lamented the time I spent solving something that a later model could do quickly. If anything, that’s the best outcome: it means more possibilities open up for more people, and it becomes much, much easier to use these tools—which is the whole point. That’s always what I wanted more than anything else: for people to be able to use these tools not just as fun things to explore, but to get real work done.

Early on, I spent a considerable amount of time trying to solve hallucination—especially because smaller models didn’t have as much knowledge of the world. They’d invent things, or find patterns between patterns, and end up making things up. There are a lot of reasons why that happens (I’ve talked about them elsewhere), but one direction I explored was: what if we could get models to ground their answers in a structured fact database?

That’s where Wikidata came in.

For anyone who hasn’t used it, Wikidata is a companion to Wikipedia. It’s a collection of hundreds of millions of discrete facts—birth dates, national languages, borders, and so on. Just about anything you can find in Wikipedia that can be represented as an atomic fact is probably in Wikidata somewhere.

The catch is that pulling anything useful out of Wikidata usually means writing SPARQL queries.

SPARQL is basically a database query language for linked data. It lets you say, “Find entries that match these conditions,” in a way that’s part logic, part database querying. If you go to the Wikidata Query Service site, you can see examples like “Find me all the pictures of cats on Wikipedia,” and it’ll do it (though that’s obviously a huge result). You can narrow it down to something like “Find images of cats named Larry,” or “Find everyone born between 1971 and 1975 in Indiana,” and it will try to return the matching entities.

To make this more concrete, here are a few simplified examples of what SPARQL queries can look like when working with Wikidata.

A basic “give me a few humans and their birth dates” query:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?person ?personLabel ?birthDate WHERE {
  ?person wdt:P31 wd:Q5;        # instance of human
          wdt:P569 ?birthDate.  # date of birth
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10

A query that matches the shape of “born between 1971 and 1975” (and returns names + birth dates):

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?person ?personLabel ?birthDate WHERE {
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?birthDate.
  FILTER(?birthDate >= "1971-01-01T00:00:00Z"^^xsd:dateTime &&
         ?birthDate <  "1976-01-01T00:00:00Z"^^xsd:dateTime)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 50

And here’s the general pattern for “things with images” (i.e., how you might approach “pictures of cats” conceptually):

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?item ?itemLabel ?image WHERE {
  ?item wdt:P31 wd:Q146;   # instance of cat (conceptually)
        wdt:P18 ?image.    # image
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 50

You can see the basic idea: you’re joining facts using these property IDs and entity IDs (like P31, P569, P18, Q5, etc.). And that leads to the real trick with Wikidata: it’s not the syntax.

Even GPT-3 could understand the shape of a query. The hard part is knowing which IDs to use. You have to know whether you’re looking for a property versus an entity, and you have to know the right codes for each one. Those IDs can feel arbitrary, and if you get them wrong, the query breaks—or returns something adjacent to what you meant.

I tried a bunch of tactics to deal with that. I experimented with training smaller models to understand those codes implicitly, and I found I could make a pretty simple Wikidata query generator. That was a fun application: a small model that could generate queries reliably enough to pull back real data. I also experimented with using embeddings to help it “search” for the right categories and IDs. It was a great area to explore.

It’s less necessary now. In many cases, it’s almost trivial: you can go into Codex and say, “Generate a Wikidata query for this,” and it’ll do a decent job. And if it doesn’t get the entities exactly right, it’s pretty easy to add context—like providing the relevant list of IDs—or to build a retrieval system that fetches the right entities/properties and feeds them in.

Still, I think it’s underexplored. A number of companies have tried to merge LLMs with database structure, and it keeps getting easier as models become more capable. Sometimes models are so capable that people don’t even bother hooking them up to a structured database like Wikidata.

But you can—and there are real benefits. If you can take a lot of your data and structure it in that format, it becomes another way to make sure you’re getting the actual data you want, instead of a model inventing a plausible-sounding answer.