This page provides basic statistics about the subset of relational Web tables in the WDC table corpus that originate from top-level-domains that likely provide English language content. In addition, small tables that have less than 3 columns and less than 5 columns are filtered out. The subset consists of 35,784,055 tables out of the 147 million Web tables in the overall corpus.
Contents
1. Identifying English TLD Web Tables
The Web tables in this corpus are extracted from the following top-level domains: "com", "org", "net", "eu" and "uk". Additionally, all tables that have less than 3 columns and less than 5 columns are filtered out. Instructions for downloading the corpus are provided on the main page.
2. Web Tables Statistics
2.1 TLDs Distribution
Figure 1 shows the distribution of extracted Web tables per English top-level domain.
Fig. 1 - Number of tables per English TLD
2.2 Number of Columns and Rows Distribution
The table below provides basic statistics for the tables' size in the complete corpus.
min. | max. | average | median | |
---|---|---|---|---|
columns | 3 | 713 | 3.49 | 4 |
rows | 5 | 35 640 | 19.52 | 6 |
2.2.1 Number of Columns Distribution
The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.
Fig. 2 - Distribution of Number of Columns per Table
The complete distribution of number of columns per table can be found here.
The file contains a list of two tab separated fields, #columns
and #tables
. E.g. the first entry of the file, 3 13261248
, means that there are 13261248 tables that have exactly 3 columns.
2.2.2 Number of Rows Distribution
Figure 3 shows the distribution of number of data rows per table. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.
Fig. 3 - Distribution of Number of Rows per Table
The complete distribution of number of rows per table can be found here.
The file contains a list of two tab separated fields, #rows
and #tables
. E.g. the first entry of the file, 5 10534980
, means that there are 10534980 tables that have exactly 5 data rows.
2.3 Headers Distribution
In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables as well as tables that require more sophisticated header unfolding. We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 180,405,169 column headers out of which 7,456,837 are different.
Fig. 4 - Popular Column Headers
The complete distribution of headers can be found here.
The file contains a list of two tab separated fields, header
and #tables
. E.g. the first entry of the file, price 1993233
, means that there are 1993233 tables that contain column with header price
.
To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset.
The complete list can be found here here.
The file contains a list of two tab separated fields, DBpediaProperty
and #tables
. E.g. the entry, title 967059
, means that there are 967059 tables that contain column with header title
.
2.4 Labels Distribution
Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column).
To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon
will be normalized to dark side moon
. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds.
In the corpus of Web tables we were able to identify total of 676,665,409 label column values, where 145,010,515 are different values.
In Table 1 is shown values coverage from different topics.
Countries | Cities | Rivers | Movies | Camera Models | Music Albums | Footballers | |||||||
Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables | Name | #Tables |
france | 52265 | new york | 29562 | mississippi | 59011 | taxi | 2417 | canon eos 40d | 124 | thriller | 1975 | robin van persie | 5880 |
germany | 45686 | luxembourg | 17726 | lena | 4604 | titanic | 2351 | canon eos 50d | 121 | aftermath | 1492 | cristiano ronaldo | 1033 |
japan | 41627 | london | 14427 | don | 4381 | avatar | 2337 | canon eos 30d | 117 | twist shout | 1116 | gareth bale | 964 |
united states | 38607 | berlin | 13141 | yangtze | 2205 | inception | 1186 | canon eos 20d | 117 | like prayer | 1005 | fernando torres | 963 |
italy | 37851 | dublin | 11734 | mackenzie | 1690 | deception | 837 | nikon d90 | 105 | true blue | 872 | frank lampard | 875 |
mexico | 34224 | andorra | 11258 | oka | 824 | moulin rouge | 798 | nikon d60 | 93 | yellow submarine | 720 | lionel messi | 809 |
netherlands | 30793 | madrid | 9962 | vardar | 544 | black knight | 772 | nikon d80 | 93 | like virgin | 710 | thierry henry | 765 |
austria | 29506 | amsterdam | 8332 | tigris | 535 | fantastic four | 718 | nikon d5000 | 89 | something new | 609 | ronaldo | 763 |
belgium | 29298 | athens | 7971 | tisa | 468 | minority report | 710 | nikon d200 | 85 | shine light | 559 | paolo maldini | 555 |
denmark | 26614 | new york city | 3728 | volga | 335 | ice age | 672 | sony cybershot dsc w120 | 78 | dark side moon | 559 | ronaldinho | 453 |
greece | 24852 | budapest | 3712 | volta | 317 | unfaithful | 670 | nikon d100 | 49 | please please me | 490 | david beckham | 412 |
ireland | 24666 | brussels | 3488 | loire | 315 | glitter | 438 | canon eos d30 | 48 | one day your life | 357 | sergio ramos | 326 |
finland | 24506 | copenhagen | 3244 | danube | 224 | joy ride | 389 | canon eos 10d | 47 | abbey road | 348 | zinedine zidane | 295 |
bulgaria | 23147 | helsinki | 2485 | indus | 220 | just married | 311 | canon eos d60 | 33 | some girls | 312 | xabi alonso | 276 |
hungary | 22629 | lisbon | 2440 | kama | 219 | from hell | 304 | sony cybershot dsc s3000 | 31 | sticky fingers | 255 | roberto carlos | 275 |
usa | 20147 | bern | 1397 | rhine | 209 | high crimes | 168 | sony cybershot dsc w510 | 31 | let bleed | 237 | fabio cannavaro | 187 |
malta | 17745 | bratislava | 1337 | ural | 194 | shallow hal | 163 | sony cybershot dsc w520 | 31 | rubber soul | 197 | rivaldo | 172 |
croatia | 16466 | kiev | 1213 | sava | 194 | monkeybone | 129 | sony cybershot dsc w570 | 31 | blood dance floor | 174 | roberto baggio | 149 |
egypt | 15924 | belgrade | 1169 | elbe | 129 | like mike | 114 | olympus e 30 | 16 | exciter | 160 | marco van basten | 129 |
cyprus | 15722 | bucharest | 1048 | seine | 117 | joe somebody | 104 | olympus e 500 | 14 | black celebration | 158 | oliver kahn | 109 |
Table. 1 - Values Coverage
2.5 Column Data Types Distribution
We used rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.
Figure 5 shows distribution of column data types.
Fig. 5 - Column Data Types Distribution
3. Credits
The extraction of the Web tables corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.


