Billion Triples Challenge 2012 Dataset

The BTC 2012 dataset serves as basis for submissions to the Billion Triples Track of the Semantic Web Challenge.

Description

The dataset was crawled during May/June 2012. Unlike previous years (in which we started from random URI samples), we used several seed sets collected from mulitple sources.

We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.

The individual crawls contain a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes. In addition, we provide "access.log.gz" files in Squid access.log format.

Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when working with web data.

Citation

Please cite as:

@misc{btc-2012,
  author = {Andreas Harth},
  title = {{Billion Triples Challenge} data set},
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2012/},
  year = 2012,
}

Crawls

We started the crawls with seed sets collected from several sources. DBpedia and Freebase URIs were crawled separately and thus excluded from the Datahub, Rest and Timbl datasets.

We performed the crawling in rounds. For each round we provide data-{round}.nq.gz, redirects-{round}.nx.gz and access-{round}.log.gz files.

File	Quads	Size (gz)
datahub/data-0.nq.gz	45595	450K
datahub/data-1.nq.gz	804375	7.5M
datahub/data-2.nq.gz	19655239	165M
datahub/data-3.nq.gz	80596583	1010M
datahub/data-4.nq.gz	808977190	7.1G
dbpedia/data-0.nq.gz	198090024	4.5G
freebase/data-0.nq.gz	101241556	981M
rest/data-0.nq.gz	1967224	32M
rest/data-1.nq.gz	6617276	74M
rest/data-2.nq.gz	13743742	164M
timbl/data-0.nq.gz	89	2.5K
timbl/data-1.nq.gz	16516	293K
timbl/data-2.nq.gz	87250	1.2M
timbl/data-3.nq.gz	388412	5.1M
timbl/data-4.nq.gz	9405528	113M
timbl/data-5.nq.gz	93898523	1017M
timbl/data-6.nq.gz	101010423	1.2G
Total	1436545545	17G

Datahub

The seed set for the Datahub crawl contained all example URIs marked example/* where the "*" is an RDF serialisation (thanks to Pablo Mendes for providing the URIs).

The crawl was breadth-first with hop 4 expansion. You can find the Datahub files at datahub/.

DBpedia

The seed set for the DBpedia crawl contained all DBpedia URIs from the DBpedia 3.7 dump.

No links were expanded. You can find the DBpedia files at dbpedia/.

Freebase

The seed set for the Freebase crawl contained all Freebase URIs involved in a owl:sameAs relation in the DBpedia 3.7 dump.

No links were expanded. You can find the Freebase files at freebase/. Note that due to call limits some lookups have resulted in a 403 Forbidden.

Rest

The seed set for the Rest crawl contained all other URIs involved in a owl:sameAs relation in the DBpedia 3.7 dump.

The crawl was breadth-first with hop 2 expansion. You can find the Rest files at rest/.

Timbl

The seed set for the Timbl crawl consisted of Tim Berners-Lee's FOAF file (http://www.w3.org/People/Berners-Lee/card.rdf).

The crawl was breadth-first with hop 6 expansion. You can find the Timbl files at timbl/.

Download

To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -x -nH -i 000-CONTENTS which will download the files while preserving the directory structure.

Contact

For questions about data format, server issues or download problems contact harth@kit.edu.

Previous BTC Datasets

Acknowledgement

We acknowledge the support of the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).

History

2012-11-08: Jesse Weaver's blog post on the validation of the BTC 2012 dataset
2012-11-07: Gunnar Grimnes' excellent visualisation of links and some other BTC 2012 stats
2012-09-02: Second attempt at fixing bnodes, now include unique context string
2012-08-24: Fixed bnode syntax issues, character encoding issues
2012-07-01: Dataset posted

Andreas Harth