Abstract:
Defining proper semantic types for resources in Knowledge Graphs is one of the key steps on building high quality data. Often, this information is either missing or incorrect. Thus it is crucial to define means to infer this information. Several approaches have been proposed, including reasoning, statistical analysis, and the usage of the textual information related to the resources. In this work we explore how textual information can be applied to existing semantic datasets for predicting the types for resources, relying exclusively on the textual features of their descriptions. We apply our approach to DBpedia entries, combining different standard NLP techniques and exploiting complementary information available to extract relevant features for different classifiers. Our results show that this approach is able to generate types with high precision and recall, above the state of the art, evaluated both on the DBpedia dataset (94%) as well as on the LDH gold standard dataset (80%). We also discuss the utility of the web tool we have created for this analysis, NLP4Types, which has been released as an online application to collect feedback from final users aimed at enhancing the Knowledge graph.