Web Data Commons - Web Tables (English Subset)

This page provides basic statistics about the subset of relational Web tables in the WDC table corpus that originate from top-level-domains that likely provide English language content. In addition, small tables that have less than 3 columns and less than 5 columns are filtered out. The subset consists of 35,784,055 tables out of the 147 million Web tables in the overall corpus.

1. Identifying English TLD Web Tables

The Web tables in this corpus are extracted from the following top-level domains: "com", "org", "net", "eu" and "uk". Additionally, all tables that have less than 3 columns and less than 5 columns are filtered out. Instructions for downloading the corpus are provided on the main page.

2. Web Tables Statistics

2.1 TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per English top-level domain.

Fig. 1 - Number of tables per English TLD

2.2 Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus.

	min.	max.	average	median
columns	3	713	3.49	4
rows	5	35 640	19.52	6

2.2.1 Number of Columns Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

Fig. 2 - Distribution of Number of Columns per Table

The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 3 13261248, means that there are 13261248 tables that have exactly 3 columns.

2.2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 5 10534980, means that there are 10534980 tables that have exactly 5 data rows.

2.3 Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables as well as tables that require more sophisticated header unfolding. We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 180,405,169 column headers out of which 7,456,837 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, price 1993233, means that there are 1993233 tables that contain column with header price.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 967059, means that there are 967059 tables that contain column with header title.

2.4 Labels Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 676,665,409 label column values, where 145,010,515 are different values.

In Table 1 is shown values coverage from different topics.

Countries		Cities		Rivers		Movies		Camera Models		Music Albums		Footballers
Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables
france	52265	new york	29562	mississippi	59011	taxi	2417	canon eos 40d	124	thriller	1975	robin van persie	5880
germany	45686	luxembourg	17726	lena	4604	titanic	2351	canon eos 50d	121	aftermath	1492	cristiano ronaldo	1033
japan	41627	london	14427	don	4381	avatar	2337	canon eos 30d	117	twist shout	1116	gareth bale	964
united states	38607	berlin	13141	yangtze	2205	inception	1186	canon eos 20d	117	like prayer	1005	fernando torres	963
italy	37851	dublin	11734	mackenzie	1690	deception	837	nikon d90	105	true blue	872	frank lampard	875
mexico	34224	andorra	11258	oka	824	moulin rouge	798	nikon d60	93	yellow submarine	720	lionel messi	809
netherlands	30793	madrid	9962	vardar	544	black knight	772	nikon d80	93	like virgin	710	thierry henry	765
austria	29506	amsterdam	8332	tigris	535	fantastic four	718	nikon d5000	89	something new	609	ronaldo	763
belgium	29298	athens	7971	tisa	468	minority report	710	nikon d200	85	shine light	559	paolo maldini	555
denmark	26614	new york city	3728	volga	335	ice age	672	sony cybershot dsc w120	78	dark side moon	559	ronaldinho	453
greece	24852	budapest	3712	volta	317	unfaithful	670	nikon d100	49	please please me	490	david beckham	412
ireland	24666	brussels	3488	loire	315	glitter	438	canon eos d30	48	one day your life	357	sergio ramos	326
finland	24506	copenhagen	3244	danube	224	joy ride	389	canon eos 10d	47	abbey road	348	zinedine zidane	295
bulgaria	23147	helsinki	2485	indus	220	just married	311	canon eos d60	33	some girls	312	xabi alonso	276
hungary	22629	lisbon	2440	kama	219	from hell	304	sony cybershot dsc s3000	31	sticky fingers	255	roberto carlos	275
usa	20147	bern	1397	rhine	209	high crimes	168	sony cybershot dsc w510	31	let bleed	237	fabio cannavaro	187
malta	17745	bratislava	1337	ural	194	shallow hal	163	sony cybershot dsc w520	31	rubber soul	197	rivaldo	172
croatia	16466	kiev	1213	sava	194	monkeybone	129	sony cybershot dsc w570	31	blood dance floor	174	roberto baggio	149
egypt	15924	belgrade	1169	elbe	129	like mike	114	olympus e 30	16	exciter	160	marco van basten	129
cyprus	15722	bucharest	1048	seine	117	joe somebody	104	olympus e 500	14	black celebration	158	oliver kahn	109

Table. 1 - Values Coverage

2.5 Column Data Types Distribution

We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

3. Credits

The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

Web Data Commons - English Subset of the Web Tables Corpus

Contents