Corpex — Corpora Explorer

RENDER

Corpex let's you swiftly browse through all the words of Wikipedia. Select your language, and when you start typing, the system shows you two statistics in four graphs. These are from left to right: 1) the ten most frequent words that start with the typed sequence of letters (as a barcharts and a piechart), and 2) the most frequent letter following the already typed sequence of letters (again, as a barchart and a piechart). The three dots (...) mean "other word", the dollar sign ($) means "end of the word".
In the second row, the ten most frequent following words of any input word are visualized (as a barcharts and a piechart). Here, the three dots (...) mean "other word", the dollar sign ($) means "end of sentence".

Corpex is also available as a restful webservice, simply call http://km.aifb.kit.edu/sites/corpex/corpex.php?lang=XX&q=Y with XX being the Wikipedia language code (see below) and Y being the starting letter sequence. You will get back a JSON result with the same data that you see on the page.
The bigrams statistics are available through http://km.aifb.kit.edu/sites/corpex/bigrams.php?lang=XX&q=Y with Y being two "+"-separated words representing the bigram in question, e.g. "star+wars".

Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available. Feedback, and especially suggestions for cooperation, is welcome.

Corpex is currently available in the following languages: German (de), English (en), Spanish (es), French (fr), Hungarian (hr), Romanian (ro), Albanian (sq), Bulgarian (bg), Czech (cs), Italian (it), Swedish (sv), Serbian (sr), Croatian (hr), Serbo-Croatian (sh), Bosnian (bs), and simple English (simple). It is further available for the Brown Corpus (brown). Further languages are being prepared.

Corpex is being developed within the EU FP7 project RENDER, which aims at understanding diversity on the Web.