Web Data Commons - Winter 2013 Corpus

This document contains instructions on how to download the Winter 2013 version of the Web Data Commons RDFa, Microdata, and Microformats data sets.

Download the Extracted Data

The extracted RDFa, Microdata and Microformat data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100 MB large. Overall 3,398 files with a total size of 332 GB are provided.

List of download URLs for RDF from the Winter 2013 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/structureddata/2013-11/files/file.list. The files, containing quads for a specific formats can be found in the table below, together with more detailed statistics about the number of files and sizes.

Format	Number Of Files	Total File Size	File List
html-rdfa	685	66 GB	html-rdfa.list
html-microdata	1937	189 GB	html-microdata.list
html-mf-geo	37	4 GB	html-mf-geo.list
html-mf-hcalendar	28	2 GB	html-mf-hcalendar.list
html-mf-hcard	615	60 GB	html-mf-hcard.list
html-mf-hrecipe	9	890 MB	html-mf-hrecipe.list
html-mf-hlisting	9	890 MB	html-mf-hlisting.list
html-mf-hresume	1	4 MB	html-mf-hresume.list
html-mf-hreview	50	4 GB	html-mf-hreview.list
html-mf-species	1	6 MB	html-mf-species.list
html-mf-xfn	26	2 GB	html-mf-xfn.list

Get the Code

The source code can be checked out from our Subversion repository. The extraction of Winter 2013 was done with version 0.0.2 of the extractor. Having checked-out the source code, create your own configuration by copying src/main/resources/ccrdf.properties.dist to src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file readme.txt.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

Download Instructions for the WDC RDFa, Microdata, and Microformats Data Sets (November 2013)

Download the Extracted Data

Get the Code

Get Support