Extracting Structured Data from the Common Crawl

The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.

News

Available Data Sets

RDFa, Microdata, and Microformat Data Sets

More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides three different data sets extracted from the Common Crawl 2013, 2012 and 2010 corpus. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

Web Tables

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities, and are thus useful in application contexts such as data search, table augmentation, knowledge base construction, and for various NLP tasks. The WDC Web Tables data set consists of the 147 million relational Web tables that are contained in the overall set of 11 billion HTML tables found in the Common Crawl.

Hyperlink Graph

We offer a large hyperlink graph that we extracted from the 2012 version of the Common Crawl. The WDC Hyperlink Graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. The graph can help researchers to improve search algorithms, develop spam detection methods and evaluate graph analysis algorithms. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.

License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

About Web Data Commons Project

The Web Data Commons project was started by researchers from Freie Universität Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of the project is to facilitate research and support companies in exploiting the wealth of information on the Web by extracting structured data from web crawls and provide this data for public download. Today the WDC Project is mainly maintained by the Data and Web Science Research Group at the University of Mannheim. The project is coordinated by Christian Bizer who has moved from Berlin to Mannheim.

Credits

Web Data Commons is supported by the EU FP7 projects PlanetData and LOD2, by an Amazon Web Services in Education Grant Award and by the German Research Foundation (DFG).

PlanetData Logo    LOD2 Logo    AWS Logo    DFG Logo