Billion Triples Challenge 2010 Dataset

Description

The major part of the dataset was crawled during March/April 2010 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. We also included partial data from data.gov and data.gov.uk.

The downloaded content was parsed using the Redland toolkit with the rdfxml parser. We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.

The BTC dataset also contains a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes.

Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when using web data.

The entire dataset consists of ~3.2bn statements.

Citation

Please cite as:

@misc{btc-2010,
  author = {Andreas Harth},
  title = {{Billion Triples Challenge} data set},
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2010/},
  year = 2010,
}

Download

To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -i 000-CONTENTS

Please note that the size of the combined dataset (gzipped) is around 27GB.

For questions about data format, server issues, or download problems contact the BTC dataset coordinator harth@kit.edu.

Enjoy!

Previous BTC Datasets

BTC 2009

History

2010-12-10: Check out Gunnar Aastrand Grimnes' excellent analysis of the contents of the dataset.
2010-05-30: Syntax issues (invalid blank node ids, unparsable literals) fixed, redirects.nx.gz added
2010-05-19: Dataset posted