WDC - Hyperlink Graph

This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.

We hope that the graph will be useful for researchers who develop

search algorithms that rank results based on the hyperlinks between pages.
SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

1. Levels of Aggregation

We provide the hyperlink graph on four different levels of aggregation:

Page-Level Graph - This version of the graph contains all details with each node representing a single web page (like http://dws.informatik.uni-mannheim.de/en/projects/current-projects/#c13686) and each arc a hyperlink between to two pages.
Subdomain-Level/Host Graph - This graph aggregates the page graph by subdomain/host. Each node in the graph represents a specific subdomain/host (like research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains/hosts. Note that subdomains/hosts can be of arbitrary depth.
First-Level-Subdomain Graph - Each node represents a first level subdomain (like dws.uni-mannheim.de) with all subjacent subdomains aggregated into this domain.
Pay-Level-Domain Graph - Each node represents a pay-level-domain (like uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.

The table below gives an overview of the size of the different graphs:

Graph	#Nodes	#Arcs
Page Graph	3,563 million	128,736 million
Subdomain/Host Graph	101 million	2,043 million
1st Level Subdomain Graph	95 million	1,937 million
PLD Graph	43 million	623 million

2. Data Formats and Download

We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graph in the format used by the WebGraph library and the PLD graph in the format used by Pajek. The page graphs are hosted on Amazon S3. The aggregated graphs are provided for download via a server in Mannheim, Germany.

2.1 Index/Arc Format

The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.

The following table contains the links for downloading the graphs.

Data Set	Index File	Arc File
Page Graph	see below (45 GB)	see below (331 GB)
Subdomain/Host Graph	download (832 MB)	download (9.2 GB)
1st Subdomain Graph	download (757 MB)	download (8.7 GB)
PLD Graph	download (297 MB)	download (2.8 GB)

In case you experience any problems downloading the files please find additional information here.

Downloading the page graph: The page graph (arc and indes files) are, due to their size split into in small files of around 500 MB. These files can be downloaded using wget -i http://webdatacommons.org/hyperlinkgraph/data/index.list.txt for the index files and respectively wget -i http://webdatacommons.org/hyperlinkgraph/data/arc.list.txt for the arc files.

2.2 WebGraph Framework Format

We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties. All three are necessary to load the network into the library.

Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code: BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).

2.3 Pajek NET Format

We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs). The PLD graph can be downloaded from Pajek version of PLD Graph (2.3 GB) and can directly be imported into Pajek after unzipping. To process the graph in acceptable time, we recommend to run Pajek with at least 32 GB of RAM.

2.4 Ranking Files

Beside the pure graph files, we also calculated for each host in the host graph the harmonic centrality, indegree centrality, Katz's index, and Page Rank. An interactive version of those ranking is available at wwwranking.webdatacommons.org. The incorporated rating data can also be downloaded using the files listed below. Each line within the files consists of the host and the value for the corresponding measure.

4253d61a54f38a577c1a86fba8fa0a84 hostgraph-h.tsv.gz
f8fc1d3f239f64ef5e0b95856fa216e4 hostgraph-indegree.tsv.gz
23537ca26d123298b84c24f6298c27f7 hostgraph-katz.tsv.gz
1edc26f368022e45689fc1c4ee82a5f8 hostgraph-pr.tsv.gz

3. Extraction Process and Source Code

The WDC Hyperlink Graph was extracted from the latest version of the Common Crawl , which was gathered in the first half of 2012. From this corpus, we extracted all HTML pages (mime-type: text/html) and every hyperlink pointing to another crawled HTML page (link type: a and link). For each re-direct, we include an additional node in the graph which links to the re-direct target.

Since the Common Crawl coprus is provided via the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criterion here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to coordinate the extraction nodes. The Common Crawl data set is readily partitioned into compressed files of around 100MB each including several thousand webpages. Beside those content files, also metadata files are provided. These files include for each page the URL, re-directs, mime-type, hyperlinks, and type of link using a JSON format. As these files contain all information needed to extract the hyperlink graph for the crawled webpages, we used an adapted version of the framework that we already used to extract RDFa, Microformat and Microdata from the crawled pages to parse the URL, redirect, links and link types from the metadata files. We used 100 machines on Amazon EC2 to process the metadata files. In a second step, we created an index file for each aggregation level (PLD, Domain, 1st Subdomain) and indexed the graphs based on this mappings using Apache PIG running on a 40 node Amazon Elastic MapReduce cluster (EMR).

The source code to extract the WDC Hyperlink Graph from the Common Crawl corpus can be checked out from our Subversion repository. For using the code, you will need to create your own configuration and fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. Beside the raw extraction framework, the project also includes various algorithms to format and manipulate the entire graph, as shrink them to a specific aggregation level or index the graph to compress it.

4. Topology of the Hyperlink Graph

We provide basic statistics about the topology of the graphs in a separate document.

The Laboratory for Web Algorithms provides various hyperlink graphs for public download in the format understood by the WebGraph Framework. In comparison to these graphs, the WDC Hyperlink Graph is more recent and larger.
The Stanford Large Network Dataset Collection also contains several smaller hyperlink graphs (all below 1 million nodes).

Beside of the Common Crawl that was used to extract the WDC Hyperlink Graph, there are several other public web corpora that could be used to extract hyperlink graphs:

The ClueWeb12 corpus was crawled in a similar time period as the Common Crawl. The corpus consists of 740 million English webpages. In comparison, the Common Crawl is 4 times larger and covers non-English top-level domains as well.
The Stanford WebBase project provides a Web crawl containing 118 million pages and around 1 billion links. The corpus was collected in 2001.
The Yahoo! Webscope Project has published an older version of the AltaVista crawl, created in 2002. The corpus includes 1.4 billion webpages that are connected by 6.6 billion hyperlinks.

6. License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

7. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

8. Credits

Lots of thanks to

the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.
Sebastiano Vigna for providing and supporting us with his amazing Java WebGraph library.
Andrej Mrvar for his fast and detailed answers about the usage of specific functions in Pajek.
Stephan Seufert for giving us some initial ideas about how to compress and format our graph.

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services in Education Grant award. We thank your sponsors a lot for supporting Web Data Commons.

9. References

Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer: Graph Structure in the Web - Revisited. Accepted paper at the 23rd International World Wide Web Conference (WWW2014), Web Science Track, Seoul, Korea, April 2014.

Web Data Commons - Hyperlink Graph

Contents

1. Levels of Aggregation

2. Data Formats and Download

2.1 Index/Arc Format

2.2 WebGraph Framework Format

2.3 Pajek NET Format

2.4 Ranking Files

3. Extraction Process and Source Code

4. Topology of the Hyperlink Graph

6. License

7. Feedback

8. Credits

9. References

Web Data Commons - Hyperlink Graph

Contents

1. Levels of Aggregation

2. Data Formats and Download

2.1 Index/Arc Format

2.2 WebGraph Framework Format

2.3 Pajek NET Format

2.4 Ranking Files

3. Extraction Process and Source Code

4. Topology of the Hyperlink Graph

5. Other Public Hyperlink Graphs and Web Crawls

6. License

7. Feedback

8. Credits

9. References