The major part of the dataset was crawled during March/April 2010 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. We also included partial data from data.gov and data.gov.uk.
The downloaded content was parsed using the Redland toolkit with the rdfxml
parser.
We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file.
The data is encoded in NQuads format and split into chunks of 10m statements each.
The BTC dataset also contains a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes.
Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when using web data.
The entire dataset consists of ~3.2bn statements.
@misc{btc-2010, author = {Andreas Harth}, title = {{Billion Triples Challenge} data set}, howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2010/}, year = 2010, }
To fetch the content of the entire directory, download the 000-CONTENTS file and do
$ wget -i 000-CONTENTS
Please note that the size of the combined dataset (gzipped) is around 27GB.
For questions about data format, server issues, or download problems contact the BTC dataset coordinator harth@kit.edu.
Enjoy!