Statistics and Data Analysis
Petar Ristoski
Oliver Lehmberg
Heiko Paulheim
Christian Bizer



This page provides basic statistics about the subset of relational Web tables in the WDC table corpus that originate from top-level-domains that likely provide English language content. In addition, small tables that have less than 3 columns and less than 5 columns are filtered out. The subset consists of 35,784,055 tables out of the 147 million Web tables in the overall corpus.

Contents

1. Identifying English TLD Web Tables

The Web tables in this corpus are extracted from the following top-level domains: "com", "org", "net", "eu" and "uk". Additionally, all tables that have less than 3 columns and less than 5 columns are filtered out. Instructions for downloading the corpus are provided on the main page.

2. Web Tables Statistics

2.1 TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per English top-level domain.

Fig. 1 - Number of tables per English TLD

2.2 Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus.

min.max.averagemedian
columns3 7133.494
rows535 64019.526

2.2.1 Number of Columns Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

Fig. 2 - Distribution of Number of Columns per Table


The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 3 13261248, means that there are 13261248 tables that have exactly 3 columns.

2.2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 5 10534980, means that there are 10534980 tables that have exactly 5 data rows.

2.3 Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables as well as tables that require more sophisticated header unfolding. We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 180,405,169 column headers out of which 7,456,837 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, price 1993233, means that there are 1993233 tables that contain column with header price.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 967059, means that there are 967059 tables that contain column with header title.

2.4 Labels Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 676,665,409 label column values, where 145,010,515 are different values.

In Table 1 is shown values coverage from different topics.

Countries Cities Rivers Movies Camera Models Music Albums Footballers
Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables
france 52265 new york 29562 mississippi 59011 taxi 2417 canon eos 40d 124 thriller 1975 robin van persie 5880
germany 45686 luxembourg 17726 lena 4604 titanic 2351 canon eos 50d 121 aftermath 1492 cristiano ronaldo 1033
japan 41627 london 14427 don 4381 avatar 2337 canon eos 30d 117 twist shout 1116 gareth bale 964
united states 38607 berlin 13141 yangtze 2205 inception 1186 canon eos 20d 117 like prayer 1005 fernando torres 963
italy 37851 dublin 11734 mackenzie 1690 deception 837 nikon d90 105 true blue 872 frank lampard 875
mexico 34224 andorra 11258 oka 824 moulin rouge 798 nikon d60 93 yellow submarine 720 lionel messi 809
netherlands 30793 madrid 9962 vardar 544 black knight 772 nikon d80 93 like virgin 710 thierry henry 765
austria 29506 amsterdam 8332 tigris 535 fantastic four 718 nikon d5000 89 something new 609 ronaldo 763
belgium 29298 athens 7971 tisa 468 minority report 710 nikon d200 85 shine light 559 paolo maldini 555
denmark 26614 new york city 3728 volga 335 ice age 672 sony cybershot dsc w120 78 dark side moon 559 ronaldinho 453
greece 24852 budapest 3712 volta 317 unfaithful 670 nikon d100 49 please please me 490 david beckham 412
ireland 24666 brussels 3488 loire 315 glitter 438 canon eos d30 48 one day your life 357 sergio ramos 326
finland 24506 copenhagen 3244 danube 224 joy ride 389 canon eos 10d 47 abbey road 348 zinedine zidane 295
bulgaria 23147 helsinki 2485 indus 220 just married 311 canon eos d60 33 some girls 312 xabi alonso 276
hungary 22629 lisbon 2440 kama 219 from hell 304 sony cybershot dsc s3000 31 sticky fingers 255 roberto carlos 275
usa 20147 bern 1397 rhine 209 high crimes 168 sony cybershot dsc w510 31 let bleed 237 fabio cannavaro 187
malta 17745 bratislava 1337 ural 194 shallow hal 163 sony cybershot dsc w520 31 rubber soul 197 rivaldo 172
croatia 16466 kiev 1213 sava 194 monkeybone 129 sony cybershot dsc w570 31 blood dance floor 174 roberto baggio 149
egypt 15924 belgrade 1169 elbe 129 like mike 114 olympus e 30 16 exciter 160 marco van basten 129
cyprus 15722 bucharest 1048 seine 117 joe somebody 104 olympus e 500 14 black celebration 158 oliver kahn 109

Table. 1 - Values Coverage

2.5 Column Data Types Distribution

We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

3. Credits

The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

DFG Logo       PlanetData Logo       AWS Logo