Coreference Resolution (CoRe) describes the task of determining which phrases in a text or spoken dialog refer to the same real-world entity and has been an active research field in natural language processing since the 1960s. During the course of the more than 50 years the focus has undergone a gradual change. Whereas heuristic methods marked the beginnings, especially since the 1990s more and more researches focused on machine learning methods to tackle the problem of Coreference Resolution. Although they benefit from the technological advances of modern cpu and gpu computing, they run into the same problems as all machine learning approaches. They heavily rely on the quality and quantity of available training data.
When project sucre was initiated in 2008 the lack of training data has been identified as one source for the stagnating progres in this area of research. Creating high quality annotated datasets however is a highly time-consuming task. Automated methods on the other side are a much faster way to quickly annotate high amounts of data, but unfortunately yield also in a much lower quality. One aim of the project was to combine the quality of a human annotator with the speed of the automated system. The result of this endeavour was the annotation tool COALDA.
COALDA is an annotation tool that uses a semi-supervised approach. Instead of a text centered approach like other corpora annotation tools (MMAX2), Coalda focuses on visualizing the feature space instead. As it is impossible to directly visualize a high-dimensional feature space, an indirect way has to be found. In Coalda, the feature space is visualized by training a Self Organizing Map (SOM) and visualizing this SOM. The software allows the user to:
- interactively explore the map
- label feature vectors with coreference information
- recalculate the map with different settings
The SOM is a good tool to reveal clusters of coreferent and disreferent mentions in a text document. Those cluster can then be annotated at once with the regarding coreference information. Additonally a confidence value can be added, which allows for a fast way of annotating big sets of data. Besides this main functionality the tool also allows the user to inspect the existing feature space in more detail or develope new features.
Besides gaining more data, another group of researches within the project was concerned with explorating the already existing ones. For this purpose they created a modular Framework named Sucre. Following the apparent trend they not only incorporated local features, features that are defined on a pair of mentions, but global features that work on clusters of mentions. This way it can work hand in hand with the annotation tool. Furthermore they participated in several Corefenrence Resolution Challenges among them the SemEval2010 Task1 and CoNLL-2011 Shared Task, where its performance was tested on multiple languages. Sucre showed a cosistently good performance and even optained the best results in the closed annotation tracks of English, German and Italian.
Big Data and Cross-Document Coreference Resolution
In the recent years one topic has gained increasing interest Big Data. The term describes datasets that are too big, too complex and/or to rapidly changing to process them manually or with traditional methods of data analytics. Accordingly Big Data is often described as a three dimensional problem, whereas the dimensions are also known as the three V's of Big Data: Volume, Variety & Velocity. Whilst Volume and Velocity pose problems mostly related to the underlying hardware and the runtime complexity of the used algorithms, Variety poses problems of a completely other kind. A dataset is not solely limit to a single source, can contain incomplete data and even comprise different types of data like auditory, visual or textual information. In short, it contains a lot of unstructured or semistructured content that needs a huge amount of preprocessing before it becomes accessible and analyzable. As the amount of data created by humanity is further growing, especially faster than humanities ability to store data, elaborated methods to filter out and categorize the important bits of information are needed if one wishes to harvest big data effiently.
Coreference Resolution is an important preprocessing step and builds the basis for a broad variety of other machine learning techniques. A high precision and accuracy are therefore needed as errors will be carried forth through the processing chain. The latest findings and trends in the area of (Co)Reference Resolution strongly indicate that the area of research is shifting their attention towards big data, focusing on algorithms that work on a more global scale. Cross-Document Coreference Resolution has become a field of peticular interest as it enables the analysis of data among different documents as well as previously processed data, opposed to the analysis of data within a single document, as it has been in the former focus of research.
For a human to properly understand the content of one document and relate it to another one the context is an important component. Traditional Coreference Resolution algorithms usually lack additional knowledge and solely depend on the information that can be extracted from the documents itself. To reliably relate the references from different documents to one another this however is not sufficient.
In order to overcome this problem we decided to harvest the power of knowledgebases like WikiData, DBpedia or YAGO, that contain facts about millions of entities as well as their relations among each other.[8,9,10]. We currently develope a new method that incorporates this background knowledge into the coreference resolution task to find a more accurate and reliable way of making huge collections of data accessible to both humans and machines.