X-LiSA: Cross-lingual Semantic Annotation


Overview

In recent years, large repositories of structured knowledge, such as Wikipedia and Linked Open Data (LOD) sources including DBpedia, Freebase and YAGO etc., have become valuable resources for natural language processing, especially for the automatic aggregation of knowledge from textual data. One essential component, which leverages such knowledge bases (KBs), is the linking of words or phrases in specific text documents with elements from the KBs, which we call semantic annotation. At the same time, in order to achieve the goal that speakers of different languages have access to the same information, there is an impending need for systems that can help in overcoming language barriers by facilitating multilingual and cross-lingual access to information originally produced for a different culture and language. This poses new challenges to semantic annotation tools which typically are language dependent and link documents in one language to a KB grounded in the same language. Ultimately, the goal is to construct cross-lingual semantic annotation tools that can link words or phrases in unstructured text in one language to resources in the structured KBs in any other language or to language independent representations.

Definition


On one side, we have a knowledge base KB containing a set of entities, each of which has its description in language and the relations between these entities. On the other side, we have a document containing a set of name mentions in language L'. Cross-lingual semantic annotation is to link/annotate these name mentions contained in documents in language L' with their referent entities in KB in language L.


Service

Service Address:
- http://km.aifb.kit.edu/services/text-annotation/
Input Parameters:
- source: the URL of a web page or raw text
- model: the NLP model used for mention detection (i.e., "NER" for named entities and "NGRAM" for also nominal entities)
- lang1: the language of input source information (i.e., "en" for English, "de" for German, "zh" for Chinese, "es" for Spanish, "ca" for Catalan and "sl" for Slovenian)
- lang2: the language of output knowledge base resources (i.e., "en", "de", "zh", "es", "ca" and "sl")
- kb: the knowledge base used for annotation (e.g., "dbpedia" or "wikipedia")
Output:
- XML output containing augmented text with links to a list of relevant resources in the knowledge base specified in the kb parameter
- The web page of the input URL with inserted annotations based on the resources in the knowledge base specified in the kb parameter

Example:

- Annotate a the AIFB portal in German with both named and nominal entities based on English DBpedia
- Annotate a “CNN” news page in English with only named entities based on Chinese Wikipedia 
- Annotate raw text in Chinese with named entities based on Slovenian Wikipedia

Use Case

We annotated some sample data extracted from online newsfeed and social media using our service. Based on the annotated data modeled by RDF, we can answer complex questions regarding the data using SPARQL queries. (see some example queries)

Publication





(c) 2015-2016 Lei Zhang, Institute AIFB, KIT