Corpex let's you swiftly browse through all the
words of Wikipedia. Select your language, and when you start typing, the
system shows you two statistics in four graphs. These are from left to
right: 1) the ten most frequent words that start with the typed sequence of
letters (as a barcharts and a piechart), and 2) the most frequent letter
following the already typed sequence of letters (again, as a barchart and a
piechart). The three dots (...) mean "other word", the dollar sign ($)
means "end of the word".
In the second row, the ten most frequent following words of any
input word are visualized (as a barcharts and a piechart). Here, the three
dots (...) mean "other word", the dollar sign ($) means "end of
sentence".
Corpex is also available as a restful webservice,
simply call
http://km.aifb.kit.edu/sites/corpex/corpex.php?lang=XX&q=Y
with XX being the Wikipedia language code (see below)
and Y being the starting letter sequence. You will get
back a JSON result with the same data that you see on the page.
The bigrams statistics are available through
http://km.aifb.kit.edu/sites/corpex/bigrams.php?lang=XX&q=Y
with Y being two "+"-separated words
representing the bigram in question, e.g. "star+wars".
Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available. Feedback, and especially suggestions for cooperation, is welcome.
Corpex is currently available in the following languages: German (de), English (en), Spanish (es), French (fr), Hungarian (hr), Romanian (ro), Albanian (sq), Bulgarian (bg), Czech (cs), Italian (it), Swedish (sv), Serbian (sr), Croatian (hr), Serbo-Croatian (sh), Bosnian (bs), and simple English (simple). It is further available for the Brown Corpus (brown). Further languages are being prepared.
Corpex is being developed within the EU FP7 project RENDER, which aims at understanding diversity on the Web.