Web Data Commons - August 2012 Corpus

This document contains instructions on how to access the August 2012 version of the Web Data Commons corpus as well as detailed statistics about the vocabularies that are used within the corpus.

Download the Extracted RDF Data

The extracted RDFa, Microdata and Microformat data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100 MB large. Overall, 1,416 files with a total size of 101 GB are provided.

List of download URLs for RDF from the August 2012 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/downloads/2012-08/nquads/files.list

In order to make it easier to find data from a specific website or specific top-level-domain, the N-Quad within the files are ordered by top-level-domain and pay-level-domain and listed in index files. The indexes consist of tab-separated values having the following structure: tld pld quad-file-name first-line last-line. The first column represents the top-level-domain, the second column the pay-level-domain, the third column the quad-file which contains the data that has been extracted from the pay-level-domain. The fourth and fifth column specify the first and the last line in the file that contains the data belonging to a specific pay-level-domain.

Format	Number of Files	Number of Quads	Indexfile
html-rdfa	256	1,079,175,202	html-rdfa.nq.index.gz (~7 MB)
html-microdata	401	1,488,063,426	html-microdata.nq.index.gz (~2 MB)
html-mf-geo	6	32,722,603	html-mf-geo.nq.index.gz (~0.6 MB)
html-mf-hcalendar	75	142,975,309	html-mf-hcalendar.nq.index.gz (~0.5 MB)
html-mf-hcard	501	3,547,824,107	html-mf-hcard.nq.index.gz (~20 MB)
html-mf-hrecipe	10	50,898,293	html-mf-hrecipe.nq.index.gz (~37 KB)
html-mf-hlisting	15	97,711,757	html-mf-hlisting.nq.index.gz (~60 KB)
html-mf-hresume	1	678,097	html-mf-hresume.nq.index.gz (~18 KB)
html-mf-hreview	65	207,589,518	html-mf-hreview.nq.index.gz (~300 KB)
html-mf-species	1	127,568	html-mf-species.nq.index.gz (~1 KB)
html-mf-xfn	75	703,188,115	html-mf-xfn.nq.index.gz (~6 MB)

Download the Website-Class-Property Matrixes

In order to make it easy for third parties to investigate the usage of different vocabularies and to generate seed-lists for focused crawling endeavors, we have generated a Website-Class-Property matrix for each format. The matrixes show which vocabulary term (class/property) is used by which website. The matrixes are provided as ARFF-files. Within the files, each website is represented by a single line containing the pay-level-domain name as well as binary values indicating whether the website uses a specific vocabulary term or not. If a website uses a class or property, the value is 1, 0 otherwise. SampleMatrix.arff shows the structure of the website-class-property matrixes. The matrixes cover all classes that are used by at least 5 different websites as well as all properties that are used by at least 10 different websites. Due to processing reasons we replaced the property/class identfier by its rank in the PLD list (e.g. schema.org/Product is replaced by type-4 as its the fourth most used Microdata class based on PLDs). The ranking can be found in the Co-Occurrence Matrix of the corresponding format.

RDFa Matrix (12 MB - arff)
Microdata Matrix (4.4 MB - arff)
Microformats Matrix (29 MB - arff)

Download the Class-Property-Co-occurrence Matrixes

For each class, we have generated a vector indicating how many websites use specific properties together with this class. These vectors allow you to get an impression about the richness of the published data. The vectors are provided as Microsoft Excel files. The files cover all classes that are used by at least 5 website and all properties that are used by at least 10 websites.

RDFa Co-Occurrence Matrix (0.5 MB - xlsx)
Microdata Co-Occurrence Matrix (1.7 MB - xlsx)
Microformts Co-Occurrence Matrix (32 KB - xlsx)

Download the Raw Extraction Statistics

In addition to the aggregated statistics described above, we also provide the raw extraction statistics which indicate how many triples were found in each HTML page of the Common Crawl. The raw statistics also enable you to locate the Common Crawl .arc file that contains a specific HTML page. Be warned that the raw statistic files are rather large and advised to use a parser which is able to skip invalid lines, since they could be present in the files. The files contain the following tab-separated columns (not in this order):

Source Data Columns

uri - The URL of the crawled page (e.g. http://www.example.com)
hostIp - The IP address of the computer the page was crawled from (e.g. 192.0.43.10)
mimeType - The MIME type of the page as communicated by the web server (e.g. text/html)
timestamp - Time and date when the page was crawled as UTC UNIX timestamp (e.g. 1331642430)
recordLength - Size of the HTML content in Bytes (e.g. 5831)
arcFileName - Name of the Common Crawl archive file containing the page (e.g. common-crawl/crawl-002/2010/01/06/24/1262851198118_24.arc.gz)
arcFilePos - Byte offset of the page inside the archive file (e.g. 81995525)

Result Data Columns

detectedMimeType - MIME type as detected by the extractor (e.g. text/html)
html-* - Number of triples found on the page for each extractor identifier (e.g. 42)
totalTriples - Number of all triples found on this HTML page (e.g. 42)
referencedData - The format, whose regex matched first with the HTML page, correspondes to the extractions pre-processing (e.g. html-mf-geo)

Sample Extraction Statistic File (tab-separated file)
Extraction Statistic File Header (tab-separated file)
Extraction Statistic File (17 GB) (tab-separated file)

Get the Code

The source code can be checked out from our Subversion repository. The extraction of August 2012 was done with version 0.0.1 of the extractor. Having checked-out the source code, create your own configuration by copying src/main/resources/ccrdf.properties.dist to src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file readme.txt.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

Web Data Commons - August 2012 Corpus - Download Instructions

Contents