WDC - RDFa, Microdata, and Microformat Data Sets

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats.
The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

1. About Web Data Commons

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads. In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format.

Up till now, we have extracted all RDFa, Microdata and Microformats data from the following releases of the Common Crawl web corpora:

November 2013
August 2012
2009/2010

For the future, we plan to rerun our extraction on a regular basis as new Common Crawl corpora are becoming available.

2. Extracted Data Formats

The table below provides an overview of the different structured data formats that we extract from the Common Crawl. The table contains references to the specifications of the formats as well as short descriptions of the formats. Web Data Commons packages the extracted data for each format separately for download. The table also defines the format identifiers that are used in the following.

Format	Description	Identifier
RDFa	RDFa is a specification for attributes to express structured data in any markup language, e.g HTML. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time.	`html-rdfa`
HTML Microdata	Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content.	`html-microdata`
hCalendar Microformat	hCalendar is a calendaring and events format, using a 1:1 representation of standard iCalendar (RFC2445) VEVENT properties and values in HTML.	`html-mf-hcard`
hCard Microformat	hCard is a format for representing people, companies, organizations, and places, using a 1:1 representation of vCard (RFC2426) properties and values in HTML.	`html-mf-hcard`
Geo Microformat	Geo a 1:1 representation of the "geo" property from the vCard standard, reusing the geo property and sub-properties as-is from the hCard microformat. It can be used to markup latitude/longitude coordinates in HTML.	`html-mf-geo`
hListing Microformat	hListing is a proposal for a listings (UK English: small-ads; classifieds) format suitable for embedding in HTML.	`html-mf-hlisting`
hResume Microformat	The hResume format is based on a set of fields common to numerous resumes published today on the web embedded in HTML.	`html-mf-hresume`
hReview Microformat	hReview is a format suitable for embedding reviews (of products, services, businesses, events, etc.) in HTML.	`html-mf-hreview`
hRecipe Microformat	hRecipe is a format suitable for embedding information about recipes for cooking in HTML.	`html-mf-recipe`
Species Microformat	The Species proposal enables marking up taxonomic names for species in HTML.	`html-mf-species`
XFN Microformat	XFN (XHTML Friends Network) is a simple format to represent human relationships using hyperlinks.	`html-mf-xfn`

3. Extraction Results

3.1. Extraction Results from the November 2013 Common Crawl Corpus

The November 2013 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/crawl-data/CC-MAIN-2013-48/ .

Extraction Statistics

Crawl Date	Winter 2013
Total Data	44 Terabyte	(compressed)
Parsed HTML URLs	2,224,829,946
URLs with Triples	585,792,337
Domains in Crawl	12,831,509
Domains with Triples	1,779,935
Typed Entities	4,264,562,758
Triples	17,241,313,916

Format Breakdown

3.2. Extraction Results from the August 2012 Common Crawl Corpus

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

Extraction Statistics

Crawl Date	January-June 2012
Total Data	40.1 Terabyte	(compressed)
Parsed HTML URLs	3,005,629,093
URLs with Triples	369,254,196
Domains in Crawl	40,600,000
Domains with Triples	2,286,277
Typed Entities	1,811,471,956
Triples	7,350,953,995

Format Breakdown

Extraction Costs

The costs for parsing the 40.1 Terabytes of compressed input data of the August 2012 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 398 USD in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 5,636 machine hours.

3.3. Extraction Results from the February 2012 Common Crawl Corpus

Common Crawl did publish a pre-release version of its 2012 corpus in February. The pages contained in the pre-release are a subset of the pages contained in the August 2012 Common Crawl corpus. We also extracted the structured data from this pre-release. The resulting statistics are found here, but are superseded by the August 2012 statistics.

3.4. Extraction Results from the 2009/2010 Common Crawl Corpus

The 2009/2010 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/crawl-002/ .

Extraction Statistics

Crawl Dates	Sept 2009 (4 TB) Jan 2010 (6.9 TB) Feb 2010 (4.3 TB) Apr 2010 (4.4 TB) Aug 2010 (3.6 TB) Sept 2010 (6 TB)
Total Data	28.9 Terabyte	(compressed)
Total URLs	2,804,054,789
Parsed HTML URLs	2,565,741,671
Domains with Triples	19,113,929
URLs with Triples	147.871.837
Typed Entities	1,546,905,880
Triples	5,193,276,058

Format Breakdown

Extraction Costs

The costs for parsing the 28.9 Terabytes of compressed input data of the 2009/2010 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 576 EUR (excluding VAT) in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 3,537 machine hours.

4. Example Data

For each data format, we provide a small subset of the extracted data below for testing purposes. The data is encoded as N-Quads, with the forth element used to represent the provenance of each triple (the URL of the page the triple was extracted from). Be advised to use a parser which is able to skip invalid lines, since they could present in the data files.

5. Extraction Process

Since the Common Crawl data sets are stored in the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criteria here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to co-ordinate the extraction nodes. The Common Crawl dataset is readily partitioned into compressed files of around 100MB each. We add the identifiers of each of these files as messages to the queue. A number of EC2 nodes monitor this queue, and take file identifiers from it. The corresponding file is then downloaded from S3. Using the ARC file parser from the Common Crawl codebase, the file is split into individual web pages. On each page, we run our RDF extractor based on the Anything To Triples (Any23) library. The resulting RDF triples are then written back to S3 together with the extraction statistics, which are later collected. The advantage of this queue is that messages have to be explicitly marked as processed, which is done after the entire file has been extracted. Should any error occur, the message is requeued after some time and processed again.

Any23 parses web pages for structured data by building a DOM tree and then evaluates XPath expressions to find structured data. While profiling, we found this tree generation to account for much of the parsing cost, and we have thus searched for a way to reduce the number of times this tree is built. Our solution is to run (Java) regular expressions against each webpages prior to extraction, which detect the presence of a microformat in a HTML page, and then only run the Any23 extractor when the regular expression find potentional matches. The formats html-mf-hcard, html-mf-hcalendar, html-mf-hlisting, html-mf-hresume, html-mf-hreview and html-mf-recipe define unique enough class names, so that the presence of the class name in the HTML document is ample indication of the Microformat being present. For the remaining formats, the following table shows the used regular expressions.

Format	Regular Expression
html-rdfa	`(property\|typeof\|about\|resource)\\s*=`
html-microdata	`(itemscope\|itemprop\\s*=)`
html-mf-xfn	`<a[^>]rel\\s=\\s(\"\|')[^\"'](contact\|acquaintance\|friend\|met\|co-worker\|colleague\|co-resident\|neighbor\|child\|parent\|sibling\|spouse\|kin\|muse\|crush\|date\|sweetheart\|me)`
html-mf-geo	`class\\s=\\s(\"\|')[^\"']*geo`
html-mf-species	`class\\s=\\s(\"\|')[^\"']*species`

6. Source Code

The source code can be checked out from our Subversion repository. Afterwards, create your own configuration by copying src/main/resources/ccrdf.properties.dist to src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file readme.txt .

7. License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

8. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

9. Credits

Web Data Commons is a joint effort of the Research Group Data and Web Science at the University of Mannheim (Christian Bizer, Robert Meusel, Petar Petrovski ) and the Institute AIFB at the Karlsruhe Institute of Technology (Andreas Harth, Steffen Stadtmüller). The initial version of the extraction code was written by Hannes Mühleisen, now working at CWI in Amsterdam.

Lots of thanks to

the Common Crawl project for providing their great web crawl and thus enabling Web Data Commons.
the Any23 project for providing their great library of structured data parsers.

Web Data Commons is supported by the PlanetData and LOD2 research projects.

10. References

Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker: Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis. In Proceedings of the 12th International Semantic Web Conference, Part II: In-Use Track, pp.17-32 (ISWC2013).
Hannes Mühleisen, Christian Bizer: Web Data Commons - Extracting Structured Data from Two Large Web Corpora. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika, Tim Potter: Metadata Statistics for a Large Web Corpus. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika: Microformats and RDFa deployment across the Web. Blog Post.
Class Statistics from the Sindice data search engine.

Web Data Commons – RDFa, Microdata, and Microformat Data Sets

Contents

1. About Web Data Commons

2. Extracted Data Formats

3. Extraction Results

3.1. Extraction Results from the November 2013 Common Crawl Corpus

Extraction Statistics

Format Breakdown

3.2. Extraction Results from the August 2012 Common Crawl Corpus

Extraction Statistics

Format Breakdown

Extraction Costs

3.3. Extraction Results from the February 2012 Common Crawl Corpus

3.4. Extraction Results from the 2009/2010 Common Crawl Corpus

Extraction Statistics

Format Breakdown

Extraction Costs

4. Example Data

5. Extraction Process

6. Source Code

7. License

8. Feedback

9. Credits

10. References