Extracting the Hyperlink Graph from the Common Web Crawl
Robert Meusel
Oliver Lehmberg
Christian Bizer
Sebastiano Vigna


This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.

We hope that the graph will be useful for researchers who develop

Contents

1. Levels of Aggregation

We provide the hyperlink graph on four different levels of aggregation:

The table below gives an overview of the size of the different graphs:

Graph#Nodes#Arcs
Page Graph3,563 million128,736 million
Subdomain/Host Graph101 million2,043 million
1st Level Subdomain Graph95 million1,937 million
PLD Graph43 million623 million

2. Data Formats and Download

We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graph in the format used by the WebGraph library and the PLD graph in the format used by Pajek. The page graphs are hosted on Amazon S3. The aggregated graphs are provided for download via a server in Mannheim, Germany.

2.1 Index/Arc Format

The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.

The following table contains the links for downloading the graphs.

Data SetIndex FileArc File
Page Graphsee below (45 GB)see below (331 GB)
Subdomain/Host Graphdownload (832 MB)download (9.2 GB)
1st Subdomain Graphdownload (757 MB)download (8.7 GB)
PLD Graphdownload (297 MB)download (2.8 GB)
In case you experience any problems downloading the files please find additional information here.

Downloading the page graph: The page graph (arc and indes files) are, due to their size split into in small files of around 500 MB. These files can be downloaded using wget -i http://webdatacommons.org/hyperlinkgraph/data/index.list.txt for the index files and respectively wget -i http://webdatacommons.org/hyperlinkgraph/data/arc.list.txt for the arc files.

2.2 WebGraph Framework Format

We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties. All three are necessary to load the network into the library.

Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code: BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).

2.3 Pajek NET Format

We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs). The PLD graph can be downloaded from Pajek version of PLD Graph (2.3 GB) and can directly be imported into Pajek after unzipping. To process the graph in acceptable time, we recommend to run Pajek with at least 32 GB of RAM.

2.4 Ranking Files

Beside the pure graph files, we also calculated for each host in the host graph the harmonic centrality, indegree centrality, Katz's index, and Page Rank. An interactive version of those ranking is available at wwwranking.webdatacommons.org. The incorporated rating data can also be downloaded using the files listed below. Each line within the files consists of the host and the value for the corresponding measure.

3. Extraction Process and Source Code

The WDC Hyperlink Graph was extracted from the latest version of the Common Crawl , which was gathered in the first half of 2012. From this corpus, we extracted all HTML pages (mime-type: text/html) and every hyperlink pointing to another crawled HTML page (link type: a and link). For each re-direct, we include an additional node in the graph which links to the re-direct target.

Since the Common Crawl coprus is provided via the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criterion here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to coordinate the extraction nodes. The Common Crawl data set is readily partitioned into compressed files of around 100MB each including several thousand webpages. Beside those content files, also metadata files are provided. These files include for each page the URL, re-directs, mime-type, hyperlinks, and type of link using a JSON format. As these files contain all information needed to extract the hyperlink graph for the crawled webpages, we used an adapted version of the framework that we already used to extract RDFa, Microformat and Microdata from the crawled pages to parse the URL, redirect, links and link types from the metadata files. We used 100 machines on Amazon EC2 to process the metadata files. In a second step, we created an index file for each aggregation level (PLD, Domain, 1st Subdomain) and indexed the graphs based on this mappings using Apache PIG running on a 40 node Amazon Elastic MapReduce cluster (EMR).

The source code to extract the WDC Hyperlink Graph from the Common Crawl corpus can be checked out from our Subversion repository. For using the code, you will need to create your own configuration and fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. Beside the raw extraction framework, the project also includes various algorithms to format and manipulate the entire graph, as shrink them to a specific aggregation level or index the graph to compress it.

4. Topology of the Hyperlink Graph

We provide basic statistics about the topology of the graphs in a separate document.

The Laboratory for Web Algorithms provides various hyperlink graphs for public download in the format understood by the WebGraph Framework. In comparison to these graphs, the WDC Hyperlink Graph is more recent and larger.
The Stanford Large Network Dataset Collection also contains several smaller hyperlink graphs (all below 1 million nodes).

Beside of the Common Crawl that was used to extract the WDC Hyperlink Graph, there are several other public web corpora that could be used to extract hyperlink graphs:

6. License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

7. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

8. Credits

Lots of thanks to

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services in Education Grant award. We thank your sponsors a lot for supporting Web Data Commons.

PlanetData Logo    AWS Logo   

9. References