The BTC 2014 dataset may be used as basis for submissions to the Big Data (formerly: Billion Triples) Track of the Semantic Web Challenge.
The dataset was crawled during February to June 2014. We used several seed sets collected from mulitple sources, including voID descriptions and all example URIs from CKAN, see the seed files in the data root directory.
We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.
We provide a file redirects.nx.gz
which consists of <source> <target> .
tuples derived from 30x redirect HTTP response codes encountered during crawling.
Also, Content-Location:
declarations from the HTTP headers go into that file.
In addition, we provide access.log.gz
files in Squid access.log format for each hop of each crawl.
Please note that the BTC dataset is collected from the web and as such of varying quality. For example, we have seen during the crawling that some URIs redirected to different targets at different points of time, e.g., at some point URIs in a certain set all redirected to a HTTP-500 error page. Dealing with noisy data is part of the fun you'll have when working with web data.
@misc{btc-2014, author = {Tobias K{\"a}fer and Andreas Harth}, title = {{Billion Triples Challenge} data set}, howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2014/}, year = 2014, }
We crawled breadth-first and politely. Thus, for each hop, we maintained a queue per pay-level domain (PLD) and took URIs from those queues in a round-robin fashion. If the number of PLDs still having un-visited URIs is below a certain threshold, we discarded the rest of the URIs and moved on to the next hop. Per PLD we ordered the URIs in the queue according to in-link count as a measure of importance such that we do not discard important URIs. We motivated this way of crawling in our LDOW2012 paper.
To counteract the discarding and because we know from our experience that some servers face temporary outages, we fed those answering with HTTP-500 or those who have not been despite having been scheduled, back into the crawl at a later point.
For each hop, we provide data-{round}.nq.gz
, redirects-{round}.nx.gz
and access-{round}.log.gz
files. We split the largest files from rounds into smaller bites, i.e., data-{round}-{bite}.nq.gz
.
Crawl No. | Size (gzipped) | Size (unzipped) | Triple/Quad Count | Document Count | PLD Count |
01 | 3.7G | 46G | 211.918.262 | ||
02 | 4.0G | 59G | 247'881'388 | ||
03 | 4.4M | 86M | 412'327 | ||
04 | 19M | 203M | 868'219 | ||
05 | 7.6G | 114G | 501'136'390 | ||
06 | 0.3M | 5.1M | 27'623 | ||
07 | 5.3G | 137G | 563'477'028 | ||
08 | 0.5G | 12G | 40'061'653 | ||
09 | 7.5G | 157G | 565'906'367 | ||
10 | 2.1G | 44G | 164'488'613 | ||
11 | 0.6G | 11G | 40'688'530 | ||
12 | 28M | 610M | 2'371'959 | ||
13 | 15G | 421G | 1'399'507'594 | ||
14 | 5.3G | 93G | 352'012'643 | ||
Sum | 52G | 1.1T | 4'090'758'596 | 43'598'858 | 47'560 |
The files are listed in the
data/
directory.
To fetch the content of the entire directory, download the 000-CONTENTS file and do
$ wget -x -nH -i 000-CONTENTS
which will download the files while preserving the directory structure.
For questions about data format, server issues or download problems contact tobias.kaefer(ät)kit.edu.
We would like to thank Max Schmachtenberg for helping with the seed list. We acknowledge the support of our chair's sysadmins, the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).