Billion Triples Challenge 2009 Dataset

Description

The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.

The downloaded content was parsed using the Redland toolkit with rdfxml, rss-tag-soup, rdfa parsers. We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.

Citation

Please cite as:
@misc{btc-2009,
  author = {Andreas Harth},
  title = {{Billion Triples Challenge} data set}, 
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2009/},
  year = 2009
}

Download

To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -i 000-CONTENTS

Please note that the size of the combined dataset (gzipped) is around 17GB. A smaller crawl useful for testing is available at btc-2009-small.nq.gz.

For questions about data format, server issues, or download problems contact the BTC dataset coordinator harth@kit.edu.

Enjoy!

History

2012-06-15
Completed move from vmlion25.deri.ie to KIT server
2009-06-19
Fixed minor unicode and blank node syntax issues
2009-04-16
Dataset posted

General statistics (~1,14 billion statements)

We calculated statistics over a 1 % random sample of the dataset, and scaled the derived numbers up accordingly. While the sampling process introduces an error in the final numbers, we still think the numbers are a sufficiently good estimate of the overall characteristics of the dataset.

The stats were created with the NxParser library statistics package by Jürgen Umbrich.

Top-50 classes and their frequency

For resolving the prefixes with their namespaces please use the service provided by prefix.cc.

foaf:Person 38,844,500
sioct:WikiArticle 14,133,700
rss:item 13,650,000
foaf:OnlineAccount 12,188,800
geonames:Feature 7,550,100
akt:Person 5,628,300
owl:Class 2,701,500
akt:Article-Reference 2,605,700
foaf:Document 2,337,400
http://xmlns.com/wordnet/1.6/Person 1,855,500
foaf:chatEvent 1,833,700
akt:Affiliated-Person 1,287,200
http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument 1,118,900
foaf:PersonalProfileDocument 961,100
akt:Book-Section-Reference 762,500
rdf:Seq 704,900
skos:Concept 702,100
sioc:User 701,100
owl:AnnotationProperty 623,500
http://www.daml.org/2002/02/telephone/1/areacodes-ont#Exchange 616,700
sioct:Category 603,000
http://www.kisti.re.kr/isrl/ResearchRefOntology#TopicAreaOfAccomplishment 561,600
rss:channel 544,000
owl:ObjectProperty 517,300
vcard:VCard 495,900
http://purl.oclc.org/net/rss_2.0/enc#Enclosure 492,500
vcard:Name 459,400
mo:Track 458,800
http://www.kisti.re.kr/isrl/ResearchRefOntology#Location 456,600
owl:Ontology 452,000
akts:Time-Interval 447,400
owl:Restriction 441,100
akt:Project 379,500
owl:Thing 377,300
akt:Funding 369,200
http://www.geneontology.org/formats/oboInOwl#DbXref 362,200
akt:Amount-Of-Money 361,600
http://blogs.yandex.ru/schema/foaf/Posts 345,100
rdf:Statement 340,400
http://owl.mindswap.org/2003/ont/owlweb.rdf#ProgramData 338,600
http://www.kisti.re.kr/isrl/ResearchRefOntology#CreatorInfo 325,000
foaf:Agent 324,900
rdf:Property 303,800
mo:Record 296,000
akt:Publication-Reference 274,200
http://dbpedia.org/class/yago/LivingPeople 271,400
owl:FunctionalProperty 266,700
mo:MusicalArtist 260,300
lfm:ScrobbleEvent 255,800
http://www.rdfabout.com/rdf/schema/usbill/LegislativeAction 243,300

Top-50 RDF Properties and their frequency

For resolving the prefixes with their namespaces please use the service provided by prefix.cc.

http://dbpedia.org/property/wikilink 156,434,900
rdf:type 143,479,200
rdfs:seeAlso 53,852,300
foaf:knows 35,786,400
foaf:nick 32,979,500
foaf:weblog 23,239,200
dc:title 22,356,700
akt:has-author 19,541,900
sioc:links_to 19,228,400
skos:subject 18,280,600
foaf:interest 16,786,400
foaf:member_name 14,799,800
rss:link 14,357,800
foaf:holdsAccount 14,038,900
foaf:image 13,871,800
rss:title 13,524,600
rdfs:label 13,515,900
foaf:name 13,179,000
geonames:nearbyFeatures 13,128,700
dc:date 12,519,700
foaf:accountName 12,133,000
foaf:accountServiceHomepage 12,068,600
geonames:parentFeature 11,466,300
foaf:tagLine 10,677,500
rss:description 9,844,700
content:encoded 9,794,800
foaf:accountProfilePage 9,483,700
sioc:has_container 9,171,900
rdfs:comment 9,109,600
akt:cites-publication-reference 8,944,600
geonames:name 7,600,800
geo:lat 7,399,900
geo:long 7,341,000
http://dbpedia.org/property/wikiPageUsesTemplate 7,102,100
akt:full-name 7,100,700
dc:creator 6,987,900
geonames:featureClass 6,962,200
geonames:inCountry 6,827,100
geonames:locationMap 6,822,700
geonames:featureCode 6,822,300
owl:sameAs 6,539,300
http://dbpedia.org/property/redirect 6,451,500
foaf:homepage 6,427,100
http://dbpedia.org/property/abstract 5,750,400
foaf:img 5,562,800
http://purl.org/rss/1.0/modules/rss091#pubDate 4,814,700
foaf:page 4,653,200
dc:description 4,651,700
akt:has-title 4,310,500
akt:has-date 3,923,200

Top-50 PLDs of sources and their frequency in the data set

"A pay-level domain (PLD)is any domain that requires payment at a TLD or cc-TLD registrar. PLDs are usually one level below the corresponding TLD (e.g., amazon.com), with certain exceptions for cc-TLDs (e.g., ebay.co.uk, det.wa.edu.au)"[IRLBot]

dbpedia.org 403,490,100
livejournal.com 177,194,000
rkbexplorer.com 155,367,100
geonames.org 131,639,700
mybloglog.com 101,977,700
sioc-project.org 82,271,100
qdos.com 35,620,700
kanzaki.com 35,259,400
hi5.com 33,224,700
dbtune.org 25,373,800
rdfabout.com 23,738,600
plode.us 23,379,800
opiumfield.com 17,348,600
l3s.de 16,640,100
craigslist.org 14,673,100
vox.com 13,577,000
opencyc.org 13,026,000
freebase.com 12,810,000
friendfeed.com 12,403,200
fu-berlin.de 11,259,500
cpan.org 10,778,800
opera.com 9,312,800
identi.ca 9,124,900
talkdigger.com 8,861,500
daml.org 7,684,300
mindswap.org 7,379,200
deri.org 7,370,700
ya.ru 7,293,100
loadaveragezero.com 6,786,500
musicbrainz.org 6,548,600
apassant.net 6,254,200
uniprot.org 5,864,900
live.com 5,841,200
cyc.com 5,787,200
berkeleybop.org 4,772,300
blogspot.com 4,734,900
blogspace.com 4,510,500
deadjournal.com 4,466,100
rambler.ru 4,436,000
webry.info 3,818,700
linuxcnc.org 3,763,900
w3.org 3,642,000
hyphen.info 3,593,100
feedburner.com 3,467,000
bio2rdf.org 3,361,200
ontologycentral.com 3,281,700
zitgist.com 3,067,500
twitter.com 2,970,400
linkedmdb.org 2,778,000
photobucket.com 2,723,300