The BTC 2012 dataset serves as basis for submissions to the Billion Triples Track of the Semantic Web Challenge.
The dataset was crawled during May/June 2012. Unlike previous years (in which we started from random URI samples), we used several seed sets collected from mulitple sources.
We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.
The individual crawls contain a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes. In addition, we provide "access.log.gz" files in Squid access.log format.
Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when working with web data.
@misc{btc-2012, author = {Andreas Harth}, title = {{Billion Triples Challenge} data set}, howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2012/}, year = 2012, }
We started the crawls with seed sets collected from several sources. DBpedia and Freebase URIs were crawled separately and thus excluded from the Datahub, Rest and Timbl datasets.
We performed the crawling in rounds.
For each round we provide data-{round}.nq.gz
, redirects-{round}.nx.gz
and access-{round}.log.gz
files.
File | Quads | Size (gz) |
---|---|---|
datahub/data-0.nq.gz | 45595 | 450K |
datahub/data-1.nq.gz | 804375 | 7.5M |
datahub/data-2.nq.gz | 19655239 | 165M |
datahub/data-3.nq.gz | 80596583 | 1010M |
datahub/data-4.nq.gz | 808977190 | 7.1G |
dbpedia/data-0.nq.gz | 198090024 | 4.5G |
freebase/data-0.nq.gz | 101241556 | 981M |
rest/data-0.nq.gz | 1967224 | 32M |
rest/data-1.nq.gz | 6617276 | 74M |
rest/data-2.nq.gz | 13743742 | 164M |
timbl/data-0.nq.gz | 89 | 2.5K |
timbl/data-1.nq.gz | 16516 | 293K |
timbl/data-2.nq.gz | 87250 | 1.2M |
timbl/data-3.nq.gz | 388412 | 5.1M |
timbl/data-4.nq.gz | 9405528 | 113M |
timbl/data-5.nq.gz | 93898523 | 1017M |
timbl/data-6.nq.gz | 101010423 | 1.2G |
Total | 1436545545 | 17G |
The seed set for the Datahub crawl contained all example URIs marked example/*
where the "*" is an RDF serialisation (thanks to Pablo Mendes for providing the URIs).
The crawl was breadth-first with hop 4 expansion. You can find the Datahub files at datahub/.
The seed set for the DBpedia crawl contained all DBpedia URIs from the DBpedia 3.7 dump.
No links were expanded. You can find the DBpedia files at dbpedia/.
The seed set for the Freebase crawl contained all Freebase URIs involved in a owl:sameAs
relation in the DBpedia 3.7 dump.
No links were expanded.
You can find the Freebase files at freebase/.
Note that due to call limits some lookups have resulted in a 403 Forbidden
.
The seed set for the Rest crawl contained all other URIs involved in a owl:sameAs
relation in the DBpedia 3.7 dump.
The crawl was breadth-first with hop 2 expansion. You can find the Rest files at rest/.
The seed set for the Timbl crawl consisted of Tim Berners-Lee's FOAF file (http://www.w3.org/People/Berners-Lee/card.rdf).
The crawl was breadth-first with hop 6 expansion. You can find the Timbl files at timbl/.
To fetch the content of the entire directory, download the 000-CONTENTS file and do
$ wget -x -nH -i 000-CONTENTS
which will download the files while preserving the directory structure.
For questions about data format, server issues or download problems contact harth@kit.edu.
We acknowledge the support of the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).