2% makes all the difference on the LOD cloud

In a few talks I have asserted that on the LOD cloud the number of rules is greatly outnumbered by the number of facts. And in fact, I used a picture like this to visualise the ratio (yes, the small dot is really there):

Total size of all rules on LOD vs. size of all facts on LOD.

Today, Alan Bundy asked me if I could back this up with a quotable number, and of course when ones PhD supervisor asks, one does not refuse. So with the help of Joe Raad, we dug out the numbers from the LOD-a-lot crawl at http://lod-a-lot.lod.labs.vu.nl/. Here’s what we got:

First some Background considerations:
Because of the way the Semantic Web representations work, we have to think what it means to be a “rule”. After all, the entire Semantic Web / Linked Open Data cloud consists of atomic triples <subject predicate object>.

All triples in RDF Schema and OWL with the following predicates would count as “rules”, since they are ontological statements that allow the derivation of other statements: owl:sameAs, owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, rdfs:subPropertyOf, owl:disjointWith, owl:differentFrom, rdfs:domain, rdfs:range.

The numbers:
We retrieved the following numbers from the LOD-a-lot, the largest publicly available queryable crawl of the Linked Open Data cloud:

  • owl:sameAs: 558,9 M
  • rdfs:subClassOf: 4,4 M
  • owl:equivalentClass: 1 M
  • owl:disjointWith: 450 K
  • rdfs:domain: 206 K
  • rdfs:range: 197,5 K
  • rdfs:subPropertyOf: 80 K
  • owl:equivalentProperty: 8,4 K
  • owl:differentFrom 3,6 K

So in total that is just over 565M statements that would classify as “rules”. The total size of the LOD-a-lot crawl is 28.3B unique statements (the crawl is all deduciplicated). So that would make it just under 2% of the entire LOD cloud). (Notice also the very skewed frequency distribution of these statements; without owl:sameAs it would only by 0.02% of the entire LOD cloud).

Philosophical musings:
So unlike in traditional symbolic AI / KR / KBS / theorem proving thinking, the power of the LOD knowledge base does not  come from a huge amount of rules, but from it comes from the huge amount of “boring” things it knows about the world: “Edinburgh is in Scotland”, “Frank got his PhD in Edinburgh”, “Alan Bundy was his PhD supervisor”  (all of these are actually in the LOD cloud). The rules play the role of “yeast”: it’s a very small amount of the total size, but it allows the thing to grow and to derive interesting new things (eg: Alan Bundy lives in Scotland, which is not expliclty asserted anywhere on the LOD cloud).
I actually think this also says something about human intelligence. Our ability to succesfully operate in the physical and social world comes for a large part from us knowing an endless store of boring atomic facts, plus a little bit of reasoning with comparatively small rulesets (small in comparison to the number of atomic facts). But of course this last paragraph is just idle speculation.

The LOD-a-lot crawl is publicly available at http://lod-a-lot.lod.labs.vu.nl/

It was first published in ISWC 2017, it is online at https://epub.wu.ac.at/6484/
    and https://doi.org/10.1007/978-3-319-68204-4_7

One thought on “2% makes all the difference on the LOD cloud

  1. That was fascinating, thanks. We should do the same with some enterprise graphs, would be interesting to see if the ratios hold (I suspect they would — except may the sameAs, I don’t think we’re seeing as many as most enterprises have managed to get to one or a few identifiers for most entities.


