Billion Triples Challenge 2014 Dataset

The BTC 2014 dataset may be used as basis for submissions to the Big Data (formerly: Billion Triples) Track of the Semantic Web Challenge.

Description

The dataset was crawled during February to June 2014. We used several seed sets collected from mulitple sources, including voID descriptions and all example URIs from CKAN, see the seed files in the data root directory.

We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.

We provide a file redirects.nx.gz which consists of <source> <target> . tuples derived from 30x redirect HTTP response codes encountered during crawling. Also, Content-Location: declarations from the HTTP headers go into that file. In addition, we provide access.log.gz files in Squid access.log format for each hop of each crawl.

Please note that the BTC dataset is collected from the web and as such of varying quality. For example, we have seen during the crawling that some URIs redirected to different targets at different points of time, e.g., at some point URIs in a certain set all redirected to a HTTP-500 error page. Dealing with noisy data is part of the fun you'll have when working with web data.

Citation

Please cite as:

@misc{btc-2014,
  author = {Tobias K{\"a}fer and Andreas Harth},
  title = {{Billion Triples Challenge} data set},
  howpublished = {Downloaded from http://km.aifb.kit.edu/projects/btc-2014/},
  year = 2014,
}

Crawls

We crawled breadth-first and politely. Thus, for each hop, we maintained a queue per pay-level domain (PLD) and took URIs from those queues in a round-robin fashion. If the number of PLDs still having un-visited URIs is below a certain threshold, we discarded the rest of the URIs and moved on to the next hop. Per PLD we ordered the URIs in the queue according to in-link count as a measure of importance such that we do not discard important URIs. We motivated this way of crawling in our LDOW2012 paper.

To counteract the discarding and because we know from our experience that some servers face temporary outages, we fed those answering with HTTP-500 or those who have not been despite having been scheduled, back into the crawl at a later point.

For each hop, we provide data-{round}.nq.gz, redirects-{round}.nx.gz and access-{round}.log.gz files. We split the largest files from rounds into smaller bites, i.e., data-{round}-{bite}.nq.gz.

Crawl No.	Size (gzipped)	Size (unzipped)	Triple/Quad Count	Document Count	PLD Count
01	3.7G	46G	211.918.262
02	4.0G	59G	247'881'388
03	4.4M	86M	412'327
04	19M	203M	868'219
05	7.6G	114G	501'136'390
06	0.3M	5.1M	27'623
07	5.3G	137G	563'477'028
08	0.5G	12G	40'061'653
09	7.5G	157G	565'906'367
10	2.1G	44G	164'488'613
11	0.6G	11G	40'688'530
12	28M	610M	2'371'959
13	15G	421G	1'399'507'594
14	5.3G	93G	352'012'643
Sum	52G	1.1T	4'090'758'596	43'598'858	47'560

Download

The files are listed in the data/ directory. To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -x -nH -i 000-CONTENTS which will download the files while preserving the directory structure.

Contact

For questions about data format, server issues or download problems contact tobias.kaefer(ät)kit.edu.

The Subsequent BTC Dataset

BTC 2019

Previous BTC Datasets

Acknowledgements

We would like to thank Max Schmachtenberg for helping with the seed list. We acknowledge the support of our chair's sysadmins, the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).

History

2014-08-07: Dataset posted

Tobias Käfer and Andreas Harth