WDC - Web Tables

This page provides a large corpus of HTML tables for public download. The corpus has been extracted from the 2012 version of the Common Crawl and contains 147 million relational Web tables. Below we provide instructions on how to download the corpus and provide basic statistics about the tables' content.

1. Application Potential of Web Tables

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities. Initial studies by Cafarella et al. [Cafarella2008] showed that out of 14 billion HTML tables in the Google crawl, 154 million tables contained relational data (1.1%). In [Crestan2011], the authors provide a fine-grained classification of the different types of HTML tables that are found on the Web based on the Bing web crawl.

Recently, a body of interesting research has developed around exploiting relational Web tables for the following applications:

Table Search: A user that is in need for some data is likely to find a table containing the desired data somewhere on the Web. [Cafarella2009], [Venetis2010], and [Pimplikar2012] propose methods for finding matching tables given a set of keywords describing the information need. An example of a table search service is Google Table Search which allows users to search for Web tables as well as public Google Fusion Tables.
Table Extension: [Cafarella2009], [Yakout2012], [Bhagavatula2013], [DasSarma2012], and [Bizer2014] propose methods for extending a local table with additional columns based on the corpus of tables that are published on the Web. A user having a local table describing countries could for instance use these methods to have columns containing the population, area total, and the capital of each country added to his table.
Knowledge Base Augmentation: The data contained in Web tables can also be used to augment and extend general-purpose knowledge bases such as DBpedia, Freebase or YAGO. [Zhang2013] propose a method for extending the DBpedia knowledge base with data from Web tables. [Wang2012] propose a method for building a comprehensive knowledge base using linguistic patters to extract a probabilistic taxonomy of classes and entities from Web texts and add facts about the attributes of the entities using Web tables afterwards.
NLP Tasks: The table corpus could also be useful for various natural language processing tasks as the things that appear in the same column are often closely related to each other, which for example can be exploited for disambiguation.

2. The WDC Web Tables Extraction

In order to facilitate research in the areas mentioned above, we have extracted a corpus of relational Web tables from the 2012 version of the Common Crawl Web corpus. The Common Crawl corpus contains 3.5 billion HTML documents which originate from over 40 million pay-level domains.

[Cafarella2008], [Crestan2011], and [Wang2002] propose classification methods for distinguishing between relational Web tables and other HTML tables. Based on the features proposed in these papers, we implemented a classifier for detecting relational Web tables. The classifier works in two steps. First, it filters out all HTML tables that are not innermost tables, i.e., that contain tables in their cells, and that contain less than 5 cells or 3 rows. This step discards already around 90% of all HTML tables. Second, we parse the remaining tables and classify them as relational or not relational using a mixture of layout and content features. The layout features include the average and the standard deviation of the column count, row count and cell length. Additionally we compute the length consistency of the rows and columns. Content features are the percentages of link, form, image, empty, digit, and text cells. We also compute the content type consistency for rows and columns.

Altogether, we parsed 3,301,089,681 HTML pages from the Common Crawl 2012 web corpus. In these pages we found a total of 11,245,583,889 innermost tables, i.e., an average of 3.4 innermost tables per page. Out of these tables, 147,636,113 were classified as relational tables (1.3%). The tables originate from 101,177,544 different pages.

In order not to miss relational tables, we tuned our classifier for recall at the cost of precision. An evaluation on a test set of several thousand tables showed that only around 60% of the tables classified as relational are actually relational tables (compared to the 80% precision reported in [Cafarella2008]).

We ran the table extraction on Amazon EC2 using 100 AWS c1.xlarge machines, which ran an adapted version of the extraction framework that we also used to extract the WebDataCommons RDFa, Microdata, and Microformat Data Set from the Common Crawl. The extraction and table classification together took around 1,500 machine hours, which cost around 150 USD.

3. Web Tables Statistics

Below, we provide the results of some initial profiling of the WDC Web Tables data set. In addition, we provide statistics about a reduced subset of the corpus that focuses on larger tables on likely English-language webpages in a separate document.

3.1 TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of two tab separated fields, TLD and #tables. E.g. the first entry of the file, com 75229798, means that there are 75229798 tables extracted from the "com" domain.

3.2 Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

	min.	max.	average	median
columns	2	2 368	3.49	3
rows	1	70 068	12.41	6

3.2.1 Number of Columns Distribution

Figure 2 shows the distribution of number of columns per table.

Fig. 2 - Distribution of Number of Columns per Table

The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 2 70147349, means that there are 70147349 tables that have exactly two columns.

3.2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 1 426104, means that there are 426104 tables that have exactly one data row.

3.3 Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables [Crestan2011], on tables that require more sophisticated header unfolding [Chen2013], as well as on table that do not have headers (20% of all tables according to [Pimplikar2012]). We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 509,351,189 column headers out of which 28,072,596 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, name 4653155, means that there are 4653155 tables that contain column with header name.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 2121028, means that there are 2121028 tables that contain column with header title.

3.4 Label Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 1,742,015,870 label column values, where 253,001,795 are different values.

In Table 1 is shown values coverage from different topics.

Countries		Cities		Rivers		Movies		Camera Models		Music Albums		Footballers
Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables
usa	135688	new york	59398	mississippi	87367	avatar	11080	nikon d 200	1390	thriller	4268	robin van persie	7439
germany	91170	luxembourg	47722	lena	8717	inception	8121	canon eos 20 d	480	aftermath	2466	david beckham	3041
japan	76512	berlin	46850	don	6504	taxi	6292	canon eos 40 d	355	twist shout	2017	cristiano ronaldo	2927
united states	73169	london	37541	mackenzie	3346	titanic	4270	nikon d 5000	351	true blue	1737	lionel messi	1748
italy	71129	amsterdam	31548	yangtze	2241	fantastic four	2113	canon eos 30 d	346	like prayer	1616	ronaldo	1716
austria	56622	madrid	30486	oka	1708	moulin rouge	1616	nikon d 80	339	like virgin	1414	gareth bale	1708
netherlands	56533	andorra	21075	loire	1096	black knight	1298	canon eos 50 d	304	yellow submarine	1405	fernando torres	1641
mexico	55267	dublin	19790	tigris	946	deception	1286	nikon d 90	274	dark side moon	1201	frank lampard	1461
belgium	53175	athens	12228	volga	904	minority report	1201	canon eos 10 d	248	abbey road	971	thierry henry	1332
ireland	48543	budapest	9702	sava	873	ice age	1201	nikon d 60	233	something new	919	ronaldinho	1195
denmark	48389	helsinki	7761	volta	710	unfaithful	1179	nikon d 100	191	please please me	886	roberto carlos	817
finland	45156	bern	5839	vardar	595	glitter	943	canon eos d 30	172	shine light	833	xabi alonso	735
greece	42314	new york city	5611	kama	582	joy ride	674	sony cybershot dsc w120	104	some girls	801	oliver kahn	710
russia	41729	brussels	5305	tisa	552	from hell	520	canon eos d 60	93	sticky fingers	740	sergio ramos	647
hungary	38536	copenhagen	4949	ural	437	just married	459	sony cybershot dsc s3000	67	one day your life	711	paolo maldini	638
malta	37009	bratislava	4938	indus	420	shallow hal	265	sony cybershot dsc w520	64	exciter	543	zinedine zidane	517
bulgaria	36523	belgrade	4460	elbe	382	highn crimes	247	sony cybershot dsc w510	62	let bleed	492	fabio cannavaro	348
croatia	29022	lisbon	4194	danube	365	monkeybone	228	olympus e 500	53	rubber soul	464	rivaldo	331
egypt	27725	kiev	2406	rhine	352	like mike	175	sony cybershot dsc w570	45	blood dance floor	382	roberto baggio	251
cyprus	25828	bucharest	2180	seine	225	joe somebody	160	olympus e 30	38	black celebration	338	marco van basten	243

Table. 1 - Values Coverage

3.5 Column Data Types Distribution

We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

4. Data Formats and Download

The main corpus of Web tables is divided in 854,083 gzip files. Each gzip file contains Web tables extracted from a couple thousand Web pages. For each Web page that contains at least one content Web table, we provide the corresponding HTML file, the set of extracted Web tables in CSV format, and a JSON file that contains meta data for the extraction of the Web tables. Each JSON file contains the URL of the Web page, a reference to the corresponding HTML file in the gzip file, and information for each of the extracted Web tables. All files that are referring to the same Web page, share the same file name prefix, e.g. a JSON file with the name 71657325_XXXXXXX.json would referre to the HTML file 71657325_YYYYYY, and a list of CSV files: 71657325_0_ZZZZZZZ.csv, 71657325_1_ZZZZZZZ.csv etc... For each of the extracted Web tables, the JSON file contains the position of the table inside the HTML file, and basic statistics for the data in the Web tables. The complete JSON Schema can be found here.

Additionally, we split the main corpus into two smaller corpora, based on the top-level domain. The first corpus contains only Web tables that were extracted from English top-level domains, such as: "com", "org", "net", "eu" and "uk". The second corpus contains the rest of the tables that were extracted from non-English top-level domains. Each of the corpora contains 773,883 tar files, where each tar file contains couple of thousand Web tables in CSV format. The meta data file and the HTML file are not included in these corpora. The statistics for the corpus of English Web tables can be found in a separate document

The corpora of Web tables are hosted using the Amazon Simple Storage Service (S3). Due to the large size of the corpora and due to the Amazon data transfer fees, we offer the files within a Requester Pays Bucket. In a nutshell, you will need an AWS account to download the files using the s3cmd command line tool (1.5.0-alpha1 as earlier versions do not properly support the additional header information (see link)). Please note: The costs for downloading will be charged by Amazon to your AWS account. Based on the S3 pricing, the estimated download costs for the complete corpus are around 175 USD. When operating on the data within the same Amazon region (US Oregon) the access is free of charge.
S3cmd is available as standard package for almost all UNIX systems (e.g. Ubuntu apt-get install s3cmd). You can request all files at once using the command s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://WebTablesExtraction/c*

The following table contains the S3 paths for downloading the datasets.

Data Set	Amazon S3 Path	Size	#Files
Complete Corpus	`s3://WebTablesExtraction/c*`	1019 GB	854,083 (.gz)
Web Tables from English TLDs	`s3://SearchJoin-tables/englishTar/`	29.7 GB	773,883 (.tar)
Web Tables from non-English TLDs	`s3://SearchJoin-tables/nonEnglishTar/`	26.3 GB	773,883 (.tar)

You can download free data samples on the following links:

5. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

6. Credits

The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

7. Other Web Table Corpora

Beside of the WDC - Web Tables Corpus, there are several other publicly accessible Web tables corpora:

DBpedia as Tables: Tabular representation of the DBpedia knowledge base containing Wikipedia infobox data. The instances of each of the 530 classes in the knowledge base are provided as a seperate table. Altogether the tables cover 4 million entities.
WikiTables: Corpus consisting of 15 million non-infobox tables extracted from Wikipedia.
publicdata.eu: Data portal containing references to 46,000 government data sets which are mostly tabular.
data.gov: Data portal containing 85,000 data sets provided by the US government. The data sets are mostly tabular.
WebDataCommons RDFa, Microdata and Microformat Data Set: RDF data set consisting of 7 billion triples RDFa, Microdata and Microformat which originate from 40 million different websites and have been extracted from the 2012 version of the common crawl. The data can be easily translated into tables (one table per class and website).
Billion Triples Challenge 2012 Data Set: Corpus consisting of 1.4 billion Linked Data triples that were crawled from around 1000 different Linked Data sources on the Web. The data can be easily translated into tables (one table per class and data source).

8. References

[Cafarella2008] Michael J. Cafarella, Eugene Wu, Alon Halevy, Yang Zhang, Daisy Zhe Wang: WebTables: exploring the power of tables on the web. VLDB 2008.
[Crestan2011] Eric Crestan and Patrick Pantel: Web-scale table census and classification. WSDM 2011.
[Cafarella2009] Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova: Data integration for the relational web. Proc. VLDB Endow. 2009.
[Venetis2010] Venetis, Petros, Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu: Table Search Using Recovered Semantics. 2010.
[Yakout2012] Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri: InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
[Bhagavatula2013] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey: Methods for exploring and mining tables on Wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA '13), 2013.
[Pimplikar2012] Rakesh Pimplikar and Sunita Sarawagi: Answering table queries on the web using column keywords. Proc. VLDB Endow. 5:10, 2012.
[DasSarma2012] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu: Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012.
[Zhang2013] Zhang, Xiaolu, et al.: Mapping entity-attribute web tables to web-scale knowledge bases. In: Database Systems for Advanced Applications. Springer, 2013.
[Wang2012] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu: Understanding tables on the web. In Proceedings of the 31st international conference on Conceptual Modeling (ER'12), 2012.
[Wang2002] WANG, Yalin; HU, Jianying: Detecting tables in html documents. In: Document Analysis Systems V. Springer Berlin Heidelberg, 2002.
[Chen2013] Zhe Chen and Michael Cafarella: Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search Over the Web. 2013.
[Bizer2014] Christian Bizer: Search Joins with the Web. Invited Lecture at the International Conference on Database Theory (ICDT2014). 2014.

Web Data Commons - Web Tables

Contents