This page provides a large corpus of HTML tables for public download. The corpus has been extracted from the 2012 version of the Common Crawl and contains 147 million relational Web tables. Below we provide instructions on how to download the corpus and provide basic statistics about the tables' content.
Contents
1. Application Potential of Web Tables
The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities. Initial studies by Cafarella et al. [Cafarella2008] showed that out of 14 billion HTML tables in the Google crawl, 154 million tables contained relational data (1.1%). In [Crestan2011], the authors provide a fine-grained classification of the different types of HTML tables that are found on the Web based on the Bing web crawl.
Recently, a body of interesting research has developed around exploiting relational Web tables for the following applications:
- Table Search: A user that is in need for some data is likely to find a table containing the desired data somewhere on the Web. [Cafarella2009], [Venetis2010], and [Pimplikar2012] propose methods for finding matching tables given a set of keywords describing the information need. An example of a table search service is Google Table Search which allows users to search for Web tables as well as public Google Fusion Tables.
- Table Extension: [Cafarella2009], [Yakout2012], [Bhagavatula2013], [DasSarma2012], and [Bizer2014] propose methods for extending a local table with additional columns based on the corpus of tables that are published on the Web. A user having a local table describing countries could for instance use these methods to have columns containing the population, area total, and the capital of each country added to his table.
- Knowledge Base Augmentation: The data contained in Web tables can also be used to augment and extend general-purpose knowledge bases such as DBpedia, Freebase or YAGO. [Zhang2013] propose a method for extending the DBpedia knowledge base with data from Web tables. [Wang2012] propose a method for building a comprehensive knowledge base using linguistic patters to extract a probabilistic taxonomy of classes and entities from Web texts and add facts about the attributes of the entities using Web tables afterwards.
- NLP Tasks: The table corpus could also be useful for various natural language processing tasks as the things that appear in the same column are often closely related to each other, which for example can be exploited for disambiguation.
2. The WDC Web Tables Extraction
In order to facilitate research in the areas mentioned above, we have extracted a corpus of relational Web tables from the 2012 version of the Common Crawl Web corpus. The Common Crawl corpus contains 3.5 billion HTML documents which originate from over 40 million pay-level domains.
[Cafarella2008], [Crestan2011], and [Wang2002] propose classification methods for distinguishing between relational Web tables and other HTML tables. Based on the features proposed in these papers, we implemented a classifier for detecting relational Web tables. The classifier works in two steps. First, it filters out all HTML tables that are not innermost tables, i.e., that contain tables in their cells, and that contain less than 5 cells or 3 rows. This step discards already around 90% of all HTML tables. Second, we parse the remaining tables and classify them as relational or not relational using a mixture of layout and content features. The layout features include the average and the standard deviation of the column count, row count and cell length. Additionally we compute the length consistency of the rows and columns. Content features are the percentages of link, form, image, empty, digit, and text cells. We also compute the content type consistency for rows and columns.Altogether, we parsed 3,301,089,681 HTML pages from the Common Crawl 2012 web corpus. In these pages we found a total of 11,245,583,889 innermost tables, i.e., an average of 3.4 innermost tables per page. Out of these tables, 147,636,113 were classified as relational tables (1.3%). The tables originate from 101,177,544 different pages.
In order not to miss relational tables, we tuned our classifier for recall at the cost of precision. An evaluation on a test set of several thousand tables showed that only around 60% of the tables classified as relational are actually relational tables (compared to the 80% precision reported in [Cafarella2008]).
We ran the table extraction on Amazon EC2 using 100 AWS c1.xlarge machines, which ran an adapted version of the extraction framework that we also used to extract the WebDataCommons RDFa, Microdata, and Microformat Data Set from the Common Crawl. The extraction and table classification together took around 1,500 machine hours, which cost around 150 USD.
3. Web Tables Statistics
Below, we provide the results of some initial profiling of the WDC Web Tables data set. In addition, we provide statistics about a reduced subset of the corpus that focuses on larger tables on likely English-language webpages in a separate document.
3.1 TLDs Distribution
Figure 1 shows the distribution of extracted Web tables per top-level domain.
Fig. 1 - Number of tables per TLD
The complete distribution of tables per top-level domain can be found here.
The file contains a list of two tab separated fields, TLD
and #tables
. E.g. the first entry of the file, com 75229798
, means that there are 75229798 tables extracted from the "com" domain.
3.2 Number of Columns and Rows Distribution
The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.
min. | max. | average | median | |
---|---|---|---|---|
columns | 2 | 2 368 | 3.49 | 3 |
rows | 1 | 70 068 | 12.41 | 6 |
3.2.1 Number of Columns Distribution
Figure 2 shows the distribution of number of columns per table.
Fig. 2 - Distribution of Number of Columns per Table
The complete distribution of number of columns per table can be found here.
The file contains a list of two tab separated fields, #columns
and #tables
. E.g. the first entry of the file, 2 70147349
, means that there are 70147349 tables that have exactly two columns.
3.2.2 Number of Rows Distribution
Figure 3 shows the distribution of number of data rows per table. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.
Fig. 3 - Distribution of Number of Rows per Table
The complete distribution of number of rows per table can be found here.
The file contains a list of two tab separated fields, #rows
and #tables
. E.g. the first entry of the file, 1 426104
, means that there are 426104 tables that have exactly one data row.
3.3 Headers Distribution
In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables [Crestan2011], on tables that require more sophisticated header unfolding [Chen2013], as well as on table that do not have headers (20% of all tables according to [Pimplikar2012]). We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 509,351,189 column headers out of which 28,072,596 are different.
Fig. 4 - Popular Column Headers
The complete distribution of headers can be found here.
The file contains a list of two tab separated fields, header
and #tables
. E.g. the first entry of the file, name 4653155
, means that there are 4653155 tables that contain column with header name
.
To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset.
The complete list can be found here here.
The file contains a list of two tab separated fields, DBpediaProperty
and #tables
. E.g. the entry, title 2121028
, means that there are 2121028 tables that contain column with header title
.
3.4 Label Distribution
Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column).
To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon
will be normalized to dark side moon
. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds.
In the corpus of Web tables we were able to identify total of 1,742,015,870 label column values, where 253,001,795 are different values.
In Table 1 is shown values coverage from different topics.
Countries | Cities | Rivers | Movies | Camera Models | Music Albums | Footballers | |||||||
Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables |
usa | 135688 | new york | 59398 | mississippi | 87367 | avatar | 11080 | nikon d 200 | 1390 | thriller | 4268 | robin van persie | 7439 |
germany | 91170 | luxembourg | 47722 | lena | 8717 | inception | 8121 | canon eos 20 d | 480 | aftermath | 2466 | david beckham | 3041 |
japan | 76512 | berlin | 46850 | don | 6504 | taxi | 6292 | canon eos 40 d | 355 | twist shout | 2017 | cristiano ronaldo | 2927 |
united states | 73169 | london | 37541 | mackenzie | 3346 | titanic | 4270 | nikon d 5000 | 351 | true blue | 1737 | lionel messi | 1748 |
italy | 71129 | amsterdam | 31548 | yangtze | 2241 | fantastic four | 2113 | canon eos 30 d | 346 | like prayer | 1616 | ronaldo | 1716 |
austria | 56622 | madrid | 30486 | oka | 1708 | moulin rouge | 1616 | nikon d 80 | 339 | like virgin | 1414 | gareth bale | 1708 |
netherlands | 56533 | andorra | 21075 | loire | 1096 | black knight | 1298 | canon eos 50 d | 304 | yellow submarine | 1405 | fernando torres | 1641 |
mexico | 55267 | dublin | 19790 | tigris | 946 | deception | 1286 | nikon d 90 | 274 | dark side moon | 1201 | frank lampard | 1461 |
belgium | 53175 | athens | 12228 | volga | 904 | minority report | 1201 | canon eos 10 d | 248 | abbey road | 971 | thierry henry | 1332 |
ireland | 48543 | budapest | 9702 | sava | 873 | ice age | 1201 | nikon d 60 | 233 | something new | 919 | ronaldinho | 1195 |
denmark | 48389 | helsinki | 7761 | volta | 710 | unfaithful | 1179 | nikon d 100 | 191 | please please me | 886 | roberto carlos | 817 |
finland | 45156 | bern | 5839 | vardar | 595 | glitter | 943 | canon eos d 30 | 172 | shine light | 833 | xabi alonso | 735 |
greece | 42314 | new york city | 5611 | kama | 582 | joy ride | 674 | sony cybershot dsc w120 | 104 | some girls | 801 | oliver kahn | 710 |
russia | 41729 | brussels | 5305 | tisa | 552 | from hell | 520 | canon eos d 60 | 93 | sticky fingers | 740 | sergio ramos | 647 |
hungary | 38536 | copenhagen | 4949 | ural | 437 | just married | 459 | sony cybershot dsc s3000 | 67 | one day your life | 711 | paolo maldini | 638 |
malta | 37009 | bratislava | 4938 | indus | 420 | shallow hal | 265 | sony cybershot dsc w520 | 64 | exciter | 543 | zinedine zidane | 517 |
bulgaria | 36523 | belgrade | 4460 | elbe | 382 | highn crimes | 247 | sony cybershot dsc w510 | 62 | let bleed | 492 | fabio cannavaro | 348 |
croatia | 29022 | lisbon | 4194 | danube | 365 | monkeybone | 228 | olympus e 500 | 53 | rubber soul | 464 | rivaldo | 331 |
egypt | 27725 | kiev | 2406 | rhine | 352 | like mike | 175 | sony cybershot dsc w570 | 45 | blood dance floor | 382 | roberto baggio | 251 |
cyprus | 25828 | bucharest | 2180 | seine | 225 | joe somebody | 160 | olympus e 30 | 38 | black celebration | 338 | marco van basten | 243 |
Table. 1 - Values Coverage
3.5 Column Data Types Distribution
We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.
Figure 5 shows distribution of column data types.
Fig. 5 - Column Data Types Distribution
4. Data Formats and Download
The main corpus of Web tables is divided in 854,083 gzip files. Each gzip file contains Web tables extracted from a couple thousand Web pages. For each Web page that contains at least one content Web table, we provide the corresponding
HTML file, the set of extracted Web tables in CSV format, and a JSON file that contains meta data for the extraction of the Web tables. Each JSON file contains the URL of the Web page,
a reference to the corresponding HTML file in the gzip file, and information for each of the extracted Web tables. All files that are referring to the same Web page, share the same file name prefix,
e.g. a JSON file with the name 71657325_XXXXXXX.json
would referre to the HTML file 71657325_YYYYYY
, and a list of CSV files: 71657325_0_ZZZZZZZ.csv
, 71657325_1_ZZZZZZZ.csv
etc...
For each of the extracted Web tables, the JSON file contains the position of the table inside the HTML file, and basic statistics for the data in the Web tables. The complete JSON Schema can be found here.
Additionally, we split the main corpus into two smaller corpora, based on the top-level domain. The first corpus contains only Web tables that were extracted from English top-level domains, such as: "com", "org", "net", "eu" and "uk". The second corpus contains the rest of the tables that were extracted from non-English top-level domains. Each of the corpora contains 773,883 tar files, where each tar file contains couple of thousand Web tables in CSV format. The meta data file and the HTML file are not included in these corpora. The statistics for the corpus of English Web tables can be found in a separate document
The corpora of Web tables are hosted using the Amazon Simple Storage Service (S3). Due to the large size of the corpora and due to the Amazon data transfer fees, we offer the files within a Requester Pays Bucket. In a nutshell, you will need an AWS account to download the files using the s3cmd command line tool (1.5.0-alpha1 as earlier versions do not properly support the additional header information (see link)). Please note: The costs for downloading will be charged by Amazon to your AWS account. Based on the S3 pricing, the estimated download costs for the complete corpus are around 175 USD. When operating on the data within the same Amazon region (US Oregon) the access is free of charge.
S3cmd is available as standard package for almost all UNIX systems (e.g. Ubuntu apt-get install s3cmd
). You can request all files at once using the command s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://WebTablesExtraction/c*
The following table contains the S3 paths for downloading the datasets.
Data Set | Amazon S3 Path | Size | #Files |
---|---|---|---|
Complete Corpus | s3://WebTablesExtraction/c* | 1019 GB | 854,083 (.gz) |
Web Tables from English TLDs | s3://SearchJoin-tables/englishTar/ | 29.7 GB | 773,883 (.tar) |
Web Tables from non-English TLDs | s3://SearchJoin-tables/nonEnglishTar/ | 26.3 GB | 773,883 (.tar) |
5. Feedback
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.
More information about Web Data Commons is found here.
6. Credits
The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.



7. Other Web Table Corpora
Beside of the WDC - Web Tables Corpus, there are several other publicly accessible Web tables corpora:
- DBpedia as Tables: Tabular representation of the DBpedia knowledge base containing Wikipedia infobox data. The instances of each of the 530 classes in the knowledge base are provided as a seperate table. Altogether the tables cover 4 million entities.
- WikiTables: Corpus consisting of 15 million non-infobox tables extracted from Wikipedia.
- publicdata.eu: Data portal containing references to 46,000 government data sets which are mostly tabular.
- data.gov: Data portal containing 85,000 data sets provided by the US government. The data sets are mostly tabular.
- WebDataCommons RDFa, Microdata and Microformat Data Set: RDF data set consisting of 7 billion triples RDFa, Microdata and Microformat which originate from 40 million different websites and have been extracted from the 2012 version of the common crawl. The data can be easily translated into tables (one table per class and website).
- Billion Triples Challenge 2012 Data Set: Corpus consisting of 1.4 billion Linked Data triples that were crawled from around 1000 different Linked Data sources on the Web. The data can be easily translated into tables (one table per class and data source).
8. References
- [Cafarella2008] Michael J. Cafarella, Eugene Wu, Alon Halevy, Yang Zhang, Daisy Zhe Wang: WebTables: exploring the power of tables on the web. VLDB 2008.
- [Crestan2011] Eric Crestan and Patrick Pantel: Web-scale table census and classification. WSDM 2011.
- [Cafarella2009] Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova: Data integration for the relational web. Proc. VLDB Endow. 2009.
- [Venetis2010] Venetis, Petros, Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu: Table Search Using Recovered Semantics. 2010.
- [Yakout2012] Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri: InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
- [Bhagavatula2013] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey: Methods for exploring and mining tables on Wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA '13), 2013.
- [Pimplikar2012] Rakesh Pimplikar and Sunita Sarawagi: Answering table queries on the web using column keywords. Proc. VLDB Endow. 5:10, 2012.
- [DasSarma2012] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu: Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
- [Zhang2013] Zhang, Xiaolu, et al.: Mapping entity-attribute web tables to web-scale knowledge bases. In: Database Systems for Advanced Applications. Springer, 2013.
- [Wang2012] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu: Understanding tables on the web. In Proceedings of the 31st international conference on Conceptual Modeling (ER'12), 2012.
- [Wang2002] WANG, Yalin; HU, Jianying: Detecting tables in html documents. In: Document Analysis Systems V. Springer Berlin Heidelberg, 2002.
- [Chen2013] Zhe Chen and Michael Cafarella: Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search Over the Web. 2013.
- [Bizer2014] Christian Bizer: Search Joins with the Web. Invited Lecture at the International Conference on Database Theory (ICDT2014). 2014.