A Corpus of Web Tables extracted from the Common Crawl
Petar Ristoski
Oliver Lehmberg
Heiko Paulheim
Robert Meusel
Christian Bizer
Alexander Diete
Nicolas Heist
Sascha Krstanovic
Thorsten Andre Knöller


This page provides a large corpus of HTML tables for public download. The corpus has been extracted from the 2012 version of the Common Crawl and contains 147 million relational Web tables. Below we provide instructions on how to download the corpus and provide basic statistics about the tables' content.

Contents

1. Application Potential of Web Tables

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities. Initial studies by Cafarella et al. [Cafarella2008] showed that out of 14 billion HTML tables in the Google crawl, 154 million tables contained relational data (1.1%). In [Crestan2011], the authors provide a fine-grained classification of the different types of HTML tables that are found on the Web based on the Bing web crawl.

Recently, a body of interesting research has developed around exploiting relational Web tables for the following applications:

  1. Table Search: A user that is in need for some data is likely to find a table containing the desired data somewhere on the Web. [Cafarella2009], [Venetis2010], and [Pimplikar2012] propose methods for finding matching tables given a set of keywords describing the information need. An example of a table search service is Google Table Search which allows users to search for Web tables as well as public Google Fusion Tables.
  2. Table Extension: [Cafarella2009], [Yakout2012], [Bhagavatula2013], [DasSarma2012], and [Bizer2014] propose methods for extending a local table with additional columns based on the corpus of tables that are published on the Web. A user having a local table describing countries could for instance use these methods to have columns containing the population, area total, and the capital of each country added to his table.
  3. Knowledge Base Augmentation: The data contained in Web tables can also be used to augment and extend general-purpose knowledge bases such as DBpedia, Freebase or YAGO. [Zhang2013] propose a method for extending the DBpedia knowledge base with data from Web tables. [Wang2012] propose a method for building a comprehensive knowledge base using linguistic patters to extract a probabilistic taxonomy of classes and entities from Web texts and add facts about the attributes of the entities using Web tables afterwards.
  4. NLP Tasks: The table corpus could also be useful for various natural language processing tasks as the things that appear in the same column are often closely related to each other, which for example can be exploited for disambiguation.

2. The WDC Web Tables Extraction

In order to facilitate research in the areas mentioned above, we have extracted a corpus of relational Web tables from the 2012 version of the Common Crawl Web corpus. The Common Crawl corpus contains 3.5 billion HTML documents which originate from over 40 million pay-level domains.

[Cafarella2008], [Crestan2011], and [Wang2002] propose classification methods for distinguishing between relational Web tables and other HTML tables. Based on the features proposed in these papers, we implemented a classifier for detecting relational Web tables. The classifier works in two steps. First, it filters out all HTML tables that are not innermost tables, i.e., that contain tables in their cells, and that contain less than 5 cells or 3 rows. This step discards already around 90% of all HTML tables. Second, we parse the remaining tables and classify them as relational or not relational using a mixture of layout and content features. The layout features include the average and the standard deviation of the column count, row count and cell length. Additionally we compute the length consistency of the rows and columns. Content features are the percentages of link, form, image, empty, digit, and text cells. We also compute the content type consistency for rows and columns.

Altogether, we parsed 3,301,089,681 HTML pages from the Common Crawl 2012 web corpus. In these pages we found a total of 11,245,583,889 innermost tables, i.e., an average of 3.4 innermost tables per page. Out of these tables, 147,636,113 were classified as relational tables (1.3%). The tables originate from 101,177,544 different pages.

In order not to miss relational tables, we tuned our classifier for recall at the cost of precision. An evaluation on a test set of several thousand tables showed that only around 60% of the tables classified as relational are actually relational tables (compared to the 80% precision reported in [Cafarella2008]).

We ran the table extraction on Amazon EC2 using 100 AWS c1.xlarge machines, which ran an adapted version of the extraction framework that we also used to extract the WebDataCommons RDFa, Microdata, and Microformat Data Set from the Common Crawl. The extraction and table classification together took around 1,500 machine hours, which cost around 150 USD.

3. Web Tables Statistics

Below, we provide the results of some initial profiling of the WDC Web Tables data set. In addition, we provide statistics about a reduced subset of the corpus that focuses on larger tables on likely English-language webpages in a separate document.

3.1 TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of two tab separated fields, TLD and #tables. E.g. the first entry of the file, com 75229798, means that there are 75229798 tables extracted from the "com" domain.

3.2 Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

min.max.averagemedian
columns2 2 3683.493
rows170 06812.416

3.2.1 Number of Columns Distribution

Figure 2 shows the distribution of number of columns per table.

Fig. 2 - Distribution of Number of Columns per Table


The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 2 70147349, means that there are 70147349 tables that have exactly two columns.

3.2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 1 426104, means that there are 426104 tables that have exactly one data row.

3.3 Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables [Crestan2011], on tables that require more sophisticated header unfolding [Chen2013], as well as on table that do not have headers (20% of all tables according to [Pimplikar2012]). We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 509,351,189 column headers out of which 28,072,596 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, name 4653155, means that there are 4653155 tables that contain column with header name.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 2121028, means that there are 2121028 tables that contain column with header title.

3.4 Label Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 1,742,015,870 label column values, where 253,001,795 are different values.

In Table 1 is shown values coverage from different topics.

Countries Cities Rivers Movies Camera Models Music Albums Footballers
Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables
usa 135688 new york 59398 mississippi 87367 avatar 11080 nikon d 200 1390 thriller 4268 robin van persie 7439
germany 91170 luxembourg 47722 lena 8717 inception 8121 canon eos 20 d 480 aftermath 2466 david beckham 3041
japan 76512 berlin 46850 don 6504 taxi 6292 canon eos 40 d 355 twist shout 2017 cristiano ronaldo 2927
united states 73169 london 37541 mackenzie 3346 titanic 4270 nikon d 5000 351 true blue 1737 lionel messi 1748
italy 71129 amsterdam 31548 yangtze 2241 fantastic four 2113 canon eos 30 d 346 like prayer 1616 ronaldo 1716
austria 56622 madrid 30486 oka 1708 moulin rouge 1616 nikon d 80 339 like virgin 1414 gareth bale 1708
netherlands 56533 andorra 21075 loire 1096 black knight 1298 canon eos 50 d 304 yellow submarine 1405 fernando torres 1641
mexico 55267 dublin 19790 tigris 946 deception 1286 nikon d 90 274 dark side moon 1201 frank lampard 1461
belgium 53175 athens 12228 volga 904 minority report 1201 canon eos 10 d 248 abbey road 971 thierry henry 1332
ireland 48543 budapest 9702 sava 873 ice age 1201 nikon d 60 233 something new 919 ronaldinho 1195
denmark 48389 helsinki 7761 volta 710 unfaithful 1179 nikon d 100 191 please please me 886 roberto carlos 817
finland 45156 bern 5839 vardar 595 glitter 943 canon eos d 30 172 shine light 833 xabi alonso 735
greece 42314 new york city 5611 kama 582 joy ride 674 sony cybershot dsc w120 104 some girls 801 oliver kahn 710
russia 41729 brussels 5305 tisa 552 from hell 520 canon eos d 60 93 sticky fingers 740 sergio ramos 647
hungary 38536 copenhagen 4949 ural 437 just married 459 sony cybershot dsc s3000 67 one day your life 711 paolo maldini 638
malta 37009 bratislava 4938 indus 420 shallow hal 265 sony cybershot dsc w520 64 exciter 543 zinedine zidane 517
bulgaria 36523 belgrade 4460 elbe 382 highn crimes 247 sony cybershot dsc w510 62 let bleed 492 fabio cannavaro 348
croatia 29022 lisbon 4194 danube 365 monkeybone 228 olympus e 500 53 rubber soul 464 rivaldo 331
egypt 27725 kiev 2406 rhine 352 like mike 175 sony cybershot dsc w570 45 blood dance floor 382 roberto baggio 251
cyprus 25828 bucharest 2180 seine 225 joe somebody 160 olympus e 30 38 black celebration 338 marco van basten 243

Table. 1 - Values Coverage

3.5 Column Data Types Distribution

We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

4. Data Formats and Download

The main corpus of Web tables is divided in 854,083 gzip files. Each gzip file contains Web tables extracted from a couple thousand Web pages. For each Web page that contains at least one content Web table, we provide the corresponding HTML file, the set of extracted Web tables in CSV format, and a JSON file that contains meta data for the extraction of the Web tables. Each JSON file contains the URL of the Web page, a reference to the corresponding HTML file in the gzip file, and information for each of the extracted Web tables. All files that are referring to the same Web page, share the same file name prefix, e.g. a JSON file with the name 71657325_XXXXXXX.json would referre to the HTML file 71657325_YYYYYY, and a list of CSV files: 71657325_0_ZZZZZZZ.csv, 71657325_1_ZZZZZZZ.csv etc... For each of the extracted Web tables, the JSON file contains the position of the table inside the HTML file, and basic statistics for the data in the Web tables. The complete JSON Schema can be found here.

Additionally, we split the main corpus into two smaller corpora, based on the top-level domain. The first corpus contains only Web tables that were extracted from English top-level domains, such as: "com", "org", "net", "eu" and "uk". The second corpus contains the rest of the tables that were extracted from non-English top-level domains. Each of the corpora contains 773,883 tar files, where each tar file contains couple of thousand Web tables in CSV format. The meta data file and the HTML file are not included in these corpora. The statistics for the corpus of English Web tables can be found in a separate document

The corpora of Web tables are hosted using the Amazon Simple Storage Service (S3). Due to the large size of the corpora and due to the Amazon data transfer fees, we offer the files within a Requester Pays Bucket. In a nutshell, you will need an AWS account to download the files using the s3cmd command line tool (1.5.0-alpha1 as earlier versions do not properly support the additional header information (see link)). Please note: The costs for downloading will be charged by Amazon to your AWS account. Based on the S3 pricing, the estimated download costs for the complete corpus are around 175 USD. When operating on the data within the same Amazon region (US Oregon) the access is free of charge.
S3cmd is available as standard package for almost all UNIX systems (e.g. Ubuntu apt-get install s3cmd). You can request all files at once using the command s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://WebTablesExtraction/c*

The following table contains the S3 paths for downloading the datasets.

Data SetAmazon S3 PathSize#Files
Complete Corpuss3://WebTablesExtraction/c* 1019 GB854,083 (.gz)
Web Tables from English TLDss3://SearchJoin-tables/englishTar/29.7 GB773,883 (.tar)
Web Tables from non-English TLDss3://SearchJoin-tables/nonEnglishTar/26.3 GB773,883 (.tar)
You can download free data samples on the following links:

5. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

6. Credits

The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

DFG Logo       AWS Logo       PlanetData Logo      

7. Other Web Table Corpora

Beside of the WDC - Web Tables Corpus, there are several other publicly accessible Web tables corpora:

  1. DBpedia as Tables: Tabular representation of the DBpedia knowledge base containing Wikipedia infobox data. The instances of each of the 530 classes in the knowledge base are provided as a seperate table. Altogether the tables cover 4 million entities.
  2. WikiTables: Corpus consisting of 15 million non-infobox tables extracted from Wikipedia.
  3. publicdata.eu: Data portal containing references to 46,000 government data sets which are mostly tabular.
  4. data.gov: Data portal containing 85,000 data sets provided by the US government. The data sets are mostly tabular.
  5. WebDataCommons RDFa, Microdata and Microformat Data Set: RDF data set consisting of 7 billion triples RDFa, Microdata and Microformat which originate from 40 million different websites and have been extracted from the 2012 version of the common crawl. The data can be easily translated into tables (one table per class and website).
  6. Billion Triples Challenge 2012 Data Set: Corpus consisting of 1.4 billion Linked Data triples that were crawled from around 1000 different Linked Data sources on the Web. The data can be easily translated into tables (one table per class and data source).

8. References

  1. [Cafarella2008] Michael J. Cafarella, Eugene Wu, Alon Halevy, Yang Zhang, Daisy Zhe Wang: WebTables: exploring the power of tables on the web. VLDB 2008.
  2. [Crestan2011] Eric Crestan and Patrick Pantel: Web-scale table census and classification. WSDM 2011.
  3. [Cafarella2009] Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova: Data integration for the relational web. Proc. VLDB Endow. 2009.
  4. [Venetis2010] Venetis, Petros, Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu: Table Search Using Recovered Semantics. 2010.
  5. [Yakout2012] Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri: InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
  6. [Bhagavatula2013] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey: Methods for exploring and mining tables on Wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA '13), 2013.
  7. [Pimplikar2012] Rakesh Pimplikar and Sunita Sarawagi: Answering table queries on the web using column keywords. Proc. VLDB Endow. 5:10, 2012.
  8. [DasSarma2012] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu: Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
  9. [Zhang2013] Zhang, Xiaolu, et al.: Mapping entity-attribute web tables to web-scale knowledge bases. In: Database Systems for Advanced Applications. Springer, 2013.
  10. [Wang2012] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu: Understanding tables on the web. In Proceedings of the 31st international conference on Conceptual Modeling (ER'12), 2012.
  11. [Wang2002] WANG, Yalin; HU, Jianying: Detecting tables in html documents. In: Document Analysis Systems V. Springer Berlin Heidelberg, 2002.
  12. [Chen2013] Zhe Chen and Michael Cafarella: Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search Over the Web. 2013.
  13. [Bizer2014] Christian Bizer: Search Joins with the Web. Invited Lecture at the International Conference on Database Theory (ICDT2014). 2014.