The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.
The downloaded content was parsed using the Redland toolkit with rdfxml, rss-tag-soup, rdfa parsers. We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.
@misc{btc-2009, author = {Andreas Harth}, title = {{Billion Triples Challenge} data set}, howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2009/}, year = 2009 }
To fetch the content of the entire directory, download the 000-CONTENTS file and do
$ wget -i 000-CONTENTS
Please note that the size of the combined dataset (gzipped) is around 17GB. A smaller crawl useful for testing is available at btc-2009-small.nq.gz.
For questions about data format, server issues, or download problems contact the BTC dataset coordinator harth@kit.edu.
Enjoy!
We calculated statistics over a 1 % random sample of the dataset, and scaled the derived numbers up accordingly. While the sampling process introduces an error in the final numbers, we still think the numbers are a sufficiently good estimate of the overall characteristics of the dataset.
The stats were created with the NxParser library statistics package by Jürgen Umbrich.
For resolving the prefixes with their namespaces please use the service provided by prefix.cc.
foaf:Person | 38,844,500 |
sioct:WikiArticle | 14,133,700 |
rss:item | 13,650,000 |
foaf:OnlineAccount | 12,188,800 |
geonames:Feature | 7,550,100 |
akt:Person | 5,628,300 |
owl:Class | 2,701,500 |
akt:Article-Reference | 2,605,700 |
foaf:Document | 2,337,400 |
http://xmlns.com/wordnet/1.6/Person | 1,855,500 |
foaf:chatEvent | 1,833,700 |
akt:Affiliated-Person | 1,287,200 |
http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument | 1,118,900 |
foaf:PersonalProfileDocument | 961,100 |
akt:Book-Section-Reference | 762,500 |
rdf:Seq | 704,900 |
skos:Concept | 702,100 |
sioc:User | 701,100 |
owl:AnnotationProperty | 623,500 |
http://www.daml.org/2002/02/telephone/1/areacodes-ont#Exchange | 616,700 |
sioct:Category | 603,000 |
http://www.kisti.re.kr/isrl/ResearchRefOntology#TopicAreaOfAccomplishment | 561,600 |
rss:channel | 544,000 |
owl:ObjectProperty | 517,300 |
vcard:VCard | 495,900 |
http://purl.oclc.org/net/rss_2.0/enc#Enclosure | 492,500 |
vcard:Name | 459,400 |
mo:Track | 458,800 |
http://www.kisti.re.kr/isrl/ResearchRefOntology#Location | 456,600 |
owl:Ontology | 452,000 |
akts:Time-Interval | 447,400 |
owl:Restriction | 441,100 |
akt:Project | 379,500 |
owl:Thing | 377,300 |
akt:Funding | 369,200 |
http://www.geneontology.org/formats/oboInOwl#DbXref | 362,200 |
akt:Amount-Of-Money | 361,600 |
http://blogs.yandex.ru/schema/foaf/Posts | 345,100 |
rdf:Statement | 340,400 |
http://owl.mindswap.org/2003/ont/owlweb.rdf#ProgramData | 338,600 |
http://www.kisti.re.kr/isrl/ResearchRefOntology#CreatorInfo | 325,000 |
foaf:Agent | 324,900 |
rdf:Property | 303,800 |
mo:Record | 296,000 |
akt:Publication-Reference | 274,200 |
http://dbpedia.org/class/yago/LivingPeople | 271,400 |
owl:FunctionalProperty | 266,700 |
mo:MusicalArtist | 260,300 |
lfm:ScrobbleEvent | 255,800 |
http://www.rdfabout.com/rdf/schema/usbill/LegislativeAction | 243,300 |
For resolving the prefixes with their namespaces please use the service provided by prefix.cc.
http://dbpedia.org/property/wikilink | 156,434,900 |
rdf:type | 143,479,200 |
rdfs:seeAlso | 53,852,300 |
foaf:knows | 35,786,400 |
foaf:nick | 32,979,500 |
foaf:weblog | 23,239,200 |
dc:title | 22,356,700 |
akt:has-author | 19,541,900 |
sioc:links_to | 19,228,400 |
skos:subject | 18,280,600 |
foaf:interest | 16,786,400 |
foaf:member_name | 14,799,800 |
rss:link | 14,357,800 |
foaf:holdsAccount | 14,038,900 |
foaf:image | 13,871,800 |
rss:title | 13,524,600 |
rdfs:label | 13,515,900 |
foaf:name | 13,179,000 |
geonames:nearbyFeatures | 13,128,700 |
dc:date | 12,519,700 |
foaf:accountName | 12,133,000 |
foaf:accountServiceHomepage | 12,068,600 |
geonames:parentFeature | 11,466,300 |
foaf:tagLine | 10,677,500 |
rss:description | 9,844,700 |
content:encoded | 9,794,800 |
foaf:accountProfilePage | 9,483,700 |
sioc:has_container | 9,171,900 |
rdfs:comment | 9,109,600 |
akt:cites-publication-reference | 8,944,600 |
geonames:name | 7,600,800 |
geo:lat | 7,399,900 |
geo:long | 7,341,000 |
http://dbpedia.org/property/wikiPageUsesTemplate | 7,102,100 |
akt:full-name | 7,100,700 |
dc:creator | 6,987,900 |
geonames:featureClass | 6,962,200 |
geonames:inCountry | 6,827,100 |
geonames:locationMap | 6,822,700 |
geonames:featureCode | 6,822,300 |
owl:sameAs | 6,539,300 |
http://dbpedia.org/property/redirect | 6,451,500 |
foaf:homepage | 6,427,100 |
http://dbpedia.org/property/abstract | 5,750,400 |
foaf:img | 5,562,800 |
http://purl.org/rss/1.0/modules/rss091#pubDate | 4,814,700 |
foaf:page | 4,653,200 |
dc:description | 4,651,700 |
akt:has-title | 4,310,500 |
akt:has-date | 3,923,200 |
"A pay-level domain (PLD)is any domain that requires payment at a TLD or cc-TLD registrar. PLDs are usually one level below the corresponding TLD (e.g., amazon.com), with certain exceptions for cc-TLDs (e.g., ebay.co.uk, det.wa.edu.au)"[IRLBot]
dbpedia.org | 403,490,100 |
livejournal.com | 177,194,000 |
rkbexplorer.com | 155,367,100 |
geonames.org | 131,639,700 |
mybloglog.com | 101,977,700 |
sioc-project.org | 82,271,100 |
qdos.com | 35,620,700 |
kanzaki.com | 35,259,400 |
hi5.com | 33,224,700 |
dbtune.org | 25,373,800 |
rdfabout.com | 23,738,600 |
plode.us | 23,379,800 |
opiumfield.com | 17,348,600 |
l3s.de | 16,640,100 |
craigslist.org | 14,673,100 |
vox.com | 13,577,000 |
opencyc.org | 13,026,000 |
freebase.com | 12,810,000 |
friendfeed.com | 12,403,200 |
fu-berlin.de | 11,259,500 |
cpan.org | 10,778,800 |
opera.com | 9,312,800 |
identi.ca | 9,124,900 |
talkdigger.com | 8,861,500 |
daml.org | 7,684,300 |
mindswap.org | 7,379,200 |
deri.org | 7,370,700 |
ya.ru | 7,293,100 |
loadaveragezero.com | 6,786,500 |
musicbrainz.org | 6,548,600 |
apassant.net | 6,254,200 |
uniprot.org | 5,864,900 |
live.com | 5,841,200 |
cyc.com | 5,787,200 |
berkeleybop.org | 4,772,300 |
blogspot.com | 4,734,900 |
blogspace.com | 4,510,500 |
deadjournal.com | 4,466,100 |
rambler.ru | 4,436,000 |
webry.info | 3,818,700 |
linuxcnc.org | 3,763,900 |
w3.org | 3,642,000 |
hyphen.info | 3,593,100 |
feedburner.com | 3,467,000 |
bio2rdf.org | 3,361,200 |
ontologycentral.com | 3,281,700 |
zitgist.com | 3,067,500 |
twitter.com | 2,970,400 |
linkedmdb.org | 2,778,000 |
photobucket.com | 2,723,300 |