Billion Triples Challenge 2014 Dataset

The BTC 2014 dataset may be used as basis for submissions to the Big Data (formerly: Billion Triples) Track of the Semantic Web Challenge.

Description

The dataset was crawled during February to June 2014. We used several seed sets collected from mulitple sources, including voID descriptions and all example URIs from CKAN, see the seed files in the data root directory.

We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.

We provide a file redirects.nx.gz which consists of <source> <target> . tuples derived from 30x redirect HTTP response codes encountered during crawling. Also, Content-Location: declarations from the HTTP headers go into that file. In addition, we provide access.log.gz files in Squid access.log format for each hop of each crawl.

Please note that the BTC dataset is collected from the web and as such of varying quality. For example, we have seen during the crawling that some URIs redirected to different targets at different points of time, e.g., at some point URIs in a certain set all redirected to a HTTP-500 error page. Dealing with noisy data is part of the fun you'll have when working with web data.

Citation

Please cite as:
@misc{btc-2014,
  author = {Tobias K{\"a}fer and Andreas Harth},
  title = {{Billion Triples Challenge} data set},
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2014/},
  year = 2014,
}

Crawls

We crawled breadth-first and politely. Thus, for each hop, we maintained a queue per pay-level domain (PLD) and took URIs from those queues in a round-robin fashion. If the number of PLDs still having un-visited URIs is below a certain threshold, we discarded the rest of the URIs and moved on to the next hop. Per PLD we ordered the URIs in the queue according to in-link count as a measure of importance such that we do not discard important URIs. We motivated this way of crawling in our LDOW2012 paper.

To counteract the discarding and because we know from our experience that some servers face temporary outages, we fed those answering with HTTP-500 or those who have not been despite having been scheduled, back into the crawl at a later point.

For each hop, we provide data-{round}.nq.gz, redirects-{round}.nx.gz and access-{round}.log.gz files. We split the largest files from rounds into smaller bites, i.e., data-{round}-{bite}.nq.gz.

Crawl No. Size (gzipped) Size (unzipped) Triple/Quad Count Document Count PLD Count
01 3.7G 46G 211.918.262    
02 4.0G 59G 247'881'388    
03 4.4M 86M 412'327    
04 19M 203M 868'219    
05 7.6G 114G 501'136'390    
06 0.3M 5.1M 27'623    
07 5.3G 137G 563'477'028    
08 0.5G 12G 40'061'653    
09 7.5G 157G 565'906'367    
10 2.1G 44G 164'488'613    
11 0.6G 11G 40'688'530    
12 28M 610M 2'371'959    
13 15G 421G 1'399'507'594    
14 5.3G 93G 352'012'643    
Sum 52G 1.1T 4'090'758'596 43'598'858 47'560

Download

The files are listed in the data/ directory. To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -x -nH -i 000-CONTENTS which will download the files while preserving the directory structure.

Contact

For questions about data format, server issues or download problems contact tobias.kaefer(ät)kit.edu.

The Subsequent BTC Dataset

Previous BTC Datasets

Acknowledgements

We would like to thank Max Schmachtenberg for helping with the seed list. We acknowledge the support of our chair's sysadmins, the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).

PlanetData

History

2014-08-07
Dataset posted

Tobias Käfer and Andreas Harth