Billion Triples Challenge 2011 Dataset

Description

The dataset was crawled during May/June 2011 using a random sample of URIs from the BTC 2010 dataset as seed URIs.

We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.

The BTC dataset also contains a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes.

Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when working with web data.

Citation

Please cite as:

@misc{btc-2011,
  author = {Andreas Harth},
  title = {{Billion Triples Challenge} data set},
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2011/},
  year = 2011,
}

Download Dataset

To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -i 000-CONTENTS

Please note that the size of the combined dataset (gzipped) is around 20GB. The entire dataset consists of ~2bn statements.

We also provide an access.log file in Squid access.log format.

Contact

For questions about data format, server issues, or download problems contact harth@kit.edu.

Previous BTC Datasets

Acknowledgement

We acknowledge the support of the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).

History

2011-06-16: Dataset posted

Andreas Harth