Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

In addition to the basic statistics about the August 2012 Web Data Commons corpus, this document presents further statistics and analysis. First, we analyze the popularity of the websites contained in the 2012 release of the Common Crawl as well as their distribution by top-level-domain. Afterwards, we analyze the main topical areas of the published data as well as the co-occurance of popular vocabulary terms within these topical domains.
The data to verify and reproduce the results as well as to perform further investigations is available for download.

Contents

1. Popularity of the Websites within the Common Crawl


In order to get an impression about the coverage of the Common Crawl, we matched the the Alexa top 1 million sites list with the pay-level-domains (PLDs) covered by the Common Crawl. In addition, we calculated the percentages of the popular websites that contain structured data. The table also shows that frequently visited websites are more likely to contain structured data than less popular ones. For instance, nearly 50% of the Top 10,000 websites embed structured data.

Alexa PLDs in CC # Alexa PLDs in% PLDs from WDC included in CC included in Alexa
Top # PLDs Alexa List#%WDC Aug 2012OverallRDFaMicrodataMicroformats
1009999.007474.7534.3455.5668.69
1,00096396.3060362.6240.0831.6746.11
10,0009,29492.944,40047.3430.4715.5529.75
100,00085,05885.0127,16731.9416.467.2020.07
1,000,000734,88273.49151,07620.567.553.0414.18

2. Distribution by Top-Level-Domain


The following table shows the distribution of the number of websites (Pay-Level-Domains, PLD) for the Top 20 Top-Level-Domain (TLDs) within the Common Crawl (CC). The table also contains the absolute and relative number of PLDs serving structured data as well as using specific formats. The table shows that RDFa, Microdata, and Microformats are used within all TLDs. The deployment on generic TLDs, such as 'com', 'net' or 'info', seem to be slightly stronger than the deployment on country-specific TLDs.

PLDs per TLD in CCPLD publishingstructured dataRDFaMicrodataMicroformats
TLDin CC ##%#%#%#%
1com19,950,6891,317,7576.61292,1861.4675,9550.381,043,2725.23
2de2,810,04079,3662.8215,7690.564,7600.1764,3582.29
3net2,203,474145,5476.6127,0411.237,9580.36120,3395.46
4org2,064,960152,9777.4130,2311.465,8540.28125,4906.08
5co.uk1,448,24564,0434.4214,4561.004,6580.3248,9633.38
6nl951,48428,8203.038,4130.882,7180.2920,3522.14
7ru699,27527,4963.934,1440.592,2020.3122,4393.21
8info663,45146,6337.035,5900.842,2250.3440,7116.14
9it620,72620,0683.236,4781.041,6470.2713,9052.24
10com.br501,72020,4184.075,3351.061,2880.2615,5343.10
11com.au428,16421,4005.006,4741.512,1570.5014,1943.32
12fr425,20429,7947.016,4411.513,3970.8022,0735.19
13ch390,3368,6592.222,2450.584860.126,4371.65
14pl382,67015,5244.063,9511.031,4140.3711,3092.96
15cz368,42911,2713.064,9701.356930.196,3611.73
16ca319,05520,9386.565,2471.641,6320.5115,3434.81
17jp288,26714,2484.946,2062.154150.148,4032.92
18se286,74016,6495.813,6141.261,1120.3913,0824.56
19eu259,10511,4294.412,9601.147260.288,4843.27
20dk247,69310,7664.352,6381.071,1800.487,6633.09

3. Main Topical Areas of the Data

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

The statistics about the Top Classes by domain count show that the most popular topical areas of the published data (in terms of number of websites providing data about a topic) are:

Additional topical domains with smaller but significant adoption include job postings (represented using Microdata) and recipes (represented using a Microformat).

The topics of the data, as well as the formats and vocabularies used to represent the data, seem to be largely determined by the major consumers the data is targeted at: Google, Facebook, Yahoo!, and Bing. For instance, the examined RDFa data is dominated by the Open Graph Protocol promoted by Facebook, while the examined Microdata is dominated by the vocabularies promoted by Google, Yahoo!, and Bing via schema.org.

In the following, we will analyze the main topical areas in more detail. The analyses are based on the basis statistics of the August 2012 Web Data Commons corpus and the Website-Class-Property Matrixes.

3.1 Product-related Data


RDFa, Microdata and Microformats are all used for the representation of product related-data .

Products in RDFa


We identified three RDFa classes, og:"product", dv:Product, and gr:Offering, that are used each on at least 500 different websites for describing products. og:"product" is the most popular class, being used by more than 19,000 websites.

Products in Microdata


Reviewing all Microdata classes that are used in more than 100 different websites, we could identify four classes, schema:Product, schema:Offer, datavoc:Product, and datavoc:Offer, that are frequently used to describe products or offers.
The following table shows the co-occurences of these classes with other product-related classes on the same website. For instance, 4,308 websites provide product data together with aggregate ratings for these products.


Class co-occurrence# PLDs
{schema:Product} 16,612
{schema:Product, schema:Offer} 7,194
{schema:Product, schema:AggregateRating} 4,308
{schema:Product, schema:Offer,schema:AggregateRating} 3,226
{schema:Product, datavoc:Product} 2,810
{schema:Product, schema:Offer,datavoc:Product, datavoc:Offer-aggregate} 2,701

In addition to the class co-occurrences, we analyzed which properties are frequently used to describe schema:Products. The table below shows that schema:Product/name, schema:Product/description, schema:Product/image, and schema:Product/offers are the most frequently used properties.

PLDs Total
Property # %
schema:Product/name 14,342 86.34
schema:Product/description 10,297 61.99
schema:Product/image 8,093 48.72
schema:Product/offers 7,545 45.42
schema:Offer/price 6,894 41.50
schema:AggregateRating 4,308 25.93
schema:AggregateRating/ratingValue 3,990 24.02
schema:PostalAddress/streetAddress 3,723 22.41

Additionaly we analyzed the types of products that are offered by e-shops using Microdata markup.


Products in Microformats


The hListing microformat is used by 4,030 websites to represent product data. Around 80% of these websites also offer information about the price. Around 70% use the properties itemUrl and itemPhoto. Overall we found almost 3,000 websites which offer detailed hListing information (listing, lister, item, price, itemUrl and itemPhoto) within our data set.

3.2. Blog and CMS Metadata

Blog and CMS Data in RDFa

Frequently used classes within the RDFa corpus that are related to blogs and CMS are og:"article", foaf:Document, sioc:Item, sioc:Post, sioct:BlogPost, and sioc:Comment. We analyzed the co-occurence of those classes which is shown in the following table.

Class co-occurence# PLDs
{foaf:document}49,252
{foaf:document, foaf:image}36,860
{foaf:document,sioc:item}33,132
{foaf:image, sioc:item}28,799
{foaf:document, foaf:image, sioc:item}28,799
{foaf:document, foaf:image, sioc:item, sioc:useraccount}15,180
{foaf:document, foaf:image, sioc:item, sioc:useraccount, skos:concept}7,981
A possible explanation of the high co-occurrence between sioc:Item and foaf:Document could be the Drupal 7 CMS. Drupal 7 is a widely used web content management framework which supports RDFa natively and marks every page per default as both, a sioc:Item and a foaf:Document (see Drupal Blog Post).

Blog and CMS Data in Microdata


The class schema:BlogPosting is used by 25,235 websites making it the single most used Microdata class in our data set.

3.3. Open Graph Protocol (OGP)


The most frequently used RDFa classes are dominated by the Open Graph Protocol (OGP). For instance, og:"article" is used by 183,046 websites. OGP is developed and promoted by Facebook in order to ease the integration of external content into the social networking platform. In contrast to RDFa, OGP allows the usage of literals instead of URIs to identify classes. For this reason we added the 'og:'-prefix in the table and mark the literals with quotes.

Deployment of Facebook Classes


We could identify all 9 pre-defined Facebook classes to be used within our corpus. The table below shows the number of pay-level-domains where the mentioned Facebook classes could be found. The first 3 classes are widely used by different websites. Beside those classes the Facebook websites defines 6 other classes, which are not that widely adopted yet.


Class# PLD
1og:"article"183,046
2og:"blog"58,971
3og:"website"56,573
4og:"book"742
5og:"profile"394
6og:"video.movie"50
7og:"video.other"47
8og:"video.episode"20
9og:"video.tv_show"18

Facebook Class co-occurence


To determine if Facebook classes are used in combination with non-Facebook classes, we analyzed the co-occurrence of the three most frequent OGP classes together with other classes. The table below shows that only a very small fraction of the websites use Facebook classes together with other classes from non-Facebook vocabularies (FOAF being used together with OGP by 2,452 out of 183,046 websites).


Class co-occurrence#PLD
1{og:"article"}183,046
2{og:"article", og:"blog"}34,978
3{og:"article", og:"website"}20,706
4{og:"article", foaf:document}2,452
5{og:"article", og:"blog", og:"website"}2,443
6{og:"article", foaf:image}2,126

3.4. Job Postings


We identified 552 websites using schema:JobPostings. Based on the co-occurrence analysis between this class and co-occurring properties we see that around 50% of the PLDs using schema:JobPostings also use properties to describe the title, description and the location of the job. This means our corpus includes around 250 websites which provide comprehensive job descriptions and could potentially be good starting points for focused deeper crawling.

3.5. Recipes


We identified 3,278 websites offering structured data about recipes in general. Over 80% do list ingredients for their recipes and 20% of those sites offer additional information like durations and nutrition information. Besides the hRecipe classes, around 40% of the websites use in addition hCard to include information about the author. This means our corpus includes over 600 pages which contain comprehensive information about recipes and could potentially be good starting points for crawling.