Inferring Resource Types in Knowledge Graphs using NLP analysis and human-in-the-loop validation: The DBpedia Case

Tracking #: 2722-3936

Mariano Rico
Idafen Santana-Pérez

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Tool/System Report
Defining proper semantic types for resources in Knowledge Graphs is one of the key steps on building high quality data. Often, this information is either missing or incorrect. Thus it is crucial to define means to infer this information. Several approaches have been proposed, including reasoning, statistical analysis, and the usage of the textual information related to the resources. In this work we explore how textual information can be applied to existing semantic datasets for predicting the types for resources, relying exclusively on the textual features of their descriptions. We apply our approach to DBpedia entries, combining different standard NLP techniques and exploiting complementary information available to extract relevant features for different classifiers. Our results show that this approach is able to generate types with high precision and recall, above the state of the art, evaluated both on the DBpedia dataset (94%) as well as on the LDH gold standard dataset (80%). We also discuss the utility of the web tool we have created for this analysis, NLP4Types, which has been released as an online application to collect feedback from final users aimed at enhancing the Knowledge graph.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Mar/2021
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This manuscript describes a system to automatically type a noun phrase in a given text and get a feedback from users.
The system is well developed, however there are some explanations in the manuscript that need to be clear.

First, in the sections 6 and 7, the authors claim that the proposed method outperformed the existing one.
However, there seems a difference in the Gold Standard sets used for the evaluation, because the authors explains 1825 resources were used from the total of 2092, which might be used to evaluate the existing one.
Could you explain more details of this point?

Second, the authors explain that the proposed method uses DBpedia which contains type errors, and this is a reason why their method mistyped some test data.
However, more detailed analyses and discussions of the failures and the drawbacks are needed for the reader to learn the proposed system.

Third, in the section 7, the authors claims that the proposed system achieved more about six points than the existing one.
While they mention precision and recall, seeing from the Table 3, this number is in the hF-measure.
Therefore, these statements are confusing, and it could be better to compare the methods in hF-measure.

Impact. While they claim that their system outperforms the existing one, their
explanation is confusing. Therefore, the impact cannot be assessed for now, or I do not consider that it is well demonstrated in the manuscript, and I conclude that it does not
have a potential impact.

Finally, there seem some typos as follows.

p4 right column line 34-35.
The system can be decomposed is different modules,
might be
The system can be decomposed as different modules,

p5 left column line 46.
On of the goals of our tool,
might be
One of the goals of our tool,

p5 right column line 21.
our system is being to abstract
might be
our system is being too abstract

line 22
or to concrete
might be
or too concrete

Review #2
Anonymous submitted on 30/Mar/2021
Major Revision
Review Comment:

The paper describes NLT4Types an online application that explores textual information to predict types for resources. The tool allows also to collect feedback from users for the suggested typed with the aim to enhance and complete Knowledge Graphs. NLP4types is evaluated on DBpedia and LDH gold standard and it outperforms state-of-the-art tools achieving a precision of around 94% on DBpedia.

The paper is generally well-written (with some English errors that are attached to this review) and easy to follow. As this paper is submitted as a tool/system report, the paper is more focused on the system itself and its contribution.

Despite being a nice and interesting approach and tool, I have a few concerns as follows:

The impact of this tool is given by its usability. Actually, I found the tool not so usable with problems such as the description area and the debugging window very small and unreadable, very long list of the suggested types, unable to suggest a new type, and no information about the use of the suggested types (see my detailed comments below). No information is provided for evaluation in the cases when a user suggests a wrong type as this would decrease the quality of the data. Many improvements should be still implemented.

What do you mean by DBpedia chapters?

I agree with the authors that very often resources have a wrong or a very general type. In this regard, did the authors do anything to include their inferred corrected types in DBpedia? How did they evaluate the correctness of the types given as feedback by the users?

As described in the methodology section, some of the steps are optional. How does the prediction module decide if the system should use the NER component or not? When reading the paper, I thought the user chooses to apply or not some steps, but I was not able to find this on the web app.

The feedback module is in charge of providing all the options for the user. Using the web app I could only see one suggested type.

User interface user friendly:
Firstly, the web app interface is quite simple. The description text area is very small, the same for debugging area. There is a lot of unused space in the app so what is the point to have such a small field to add text? One question here, is there a limit (min and max) in the number of words/ characters that can be included in the text description? The debugging window becomes really difficult to read and follow what is happening. Apart from the small size of the debugging window, it is very difficult to understand its content. I saw that this window will disappear once the demo tool is finished, so maybe that's the reason it does not contain useful information for the final user, but isn't it better to let the user know what is going on in the process? Of course, using a debugging window that has more readable information for the user who is using the tool.

Secondly, the application suggests only one type, of course, the one with the highest confidence, and the user should give feedback only regarding that type, and in case it is the wrong type then he/she should select from a list of types. Why did the authors choose this way? I mean isn't it better to give a set of possible types, and the user should for example rank them, or give a score for each of them? Often, users are not aware of all possible types that might be good candidates, and so they can annotate it with the wrong type.

Thirdly, the feedback module. The list of the suggested types is a very very long list and is not intuitive at all. From a user perspective, it would be simpler to have a shorter list with the most generic types, and then the user might explore or dig into the most specialized concepts of the selected one. The list is very misleading as for example if I know the type that is needed to annotate the text e.g., FictionalCharacter and I search for Character it doesn't point to FictionalCharacter at all as it searches for the types starting with Character. So I would suggest the authors to work on this part as it is a bit confusing to give the right feedback for the annotated types. Are the types in this list representing somehow the hierarchy of types in DBpedia?

Fourthly, there is no way to suggest a type that is outside of the list except for None in the list. If one should suggest a new type, how can he/she do?

Fifthly, the text and the types that a user annotates, where are stored? Are they evaluated by anyone? How are they used after? The title of the paper refers to humans in the loop but nothing is said in the paper about the evaluation of human feedback. 

Sixthly, the usability of the tool. Are there any APIs? If a user wants to use NLP4types in their application, how should he do?  

I tried two different examples in the app. Both of them were texts from tales. The first one was a text from Little Red Riding Hood and the second one was related to Cinderella. The type suggested for Little Red Riding Hood was while for Cinderella the suggested type is . Both of these types are wrong, more the second one. I could suggest the correct type and then tried again with the same text and the suggested result was the same type that the app suggested before. Are these suggested types used? And if yes, how and when?

"That is, our system learns from DBpedia data and produces types mimicking it, including the potential errors on the type assignment, whereas the manually annotated ones might diverge, thus lowering the accuracy of the predictions." Shouldn't it be the other way around? So the NLP4types should be mimicking the GS instead of DBpedia as it might include, as the claim of the paper, mistyped entities?

Moreover, I was expecting also an analysis of the confusion matrics. Which types are the hardest to be inferred? Why? Nothing was said throughout the paper on the ways to improve the tool's performance. Did the authors consider testing their system in Wikidata? Why did they not include it in the paper? I think very interesting things should come up.

English grammar:
Thus it is crucial to define means to infer this information -> Thus, it is crucial to define means to infer this information
many DBpedia datasets -> many DBpedia versions
mappings are manually create -> mappings are manually created
We have also to take into account that -> We have to take into account also that
The reminder of this paper is structured as follows. -> wrong references to the sections
in which binary classifiers for each type -> the verb is missing
The related gold standards provides -> The related gold standards provide
The gold standard provided are used -> The gold standard/s provided is/are used
twofold -> two fold
to to collect feedback -> to collect feedback
state of the art -> state-of-the-art
show and interesting behaviour -show an interesting behavior
is being to abstract (i.e. inferring types that are too generic) or to concrete -> is being too abstract (i.e. inferring types that are too generic) or too concrete
Table 1 show how -> Table 1 shows how
whereas the test set used for the gold standard evaluation is manually generated and curated one -> whereas the test set used for the gold standard evaluation is manually generated and curated

Review #3
Anonymous submitted on 27/Apr/2021
Major Revision
Review Comment:

The paper describes an approach to type DBpedia entities that are missing types. As authors describe in the introduction, "about 16%" of resources in English DBpedia are missing types. In general, completeness and quality are well known issues for large automatically-created KGs. In contrast to some earlier methods, this paper strives to classify the entities using unstructured information, i.e. text.

The paper is well motivated. It would look even more convincing if
1) the authors could prvide a reference or more detailed explanation of their estimate of 16% untyped entities. Moreover, in the following paragraphs authors mentioned that the real number is assumed to be even higher -- it is not clear to the reader why such a conclusion is made;
2) the authors would add a few references to the topic, for example this paper: or this paper:

Impact. The tool being developed in the paper is published at a github repository: and deployed at the webpage: The publicly available metrics for the code repository are not revealing a particular uptake at the moment: 1 star, 2 watchers, no forks. The paper does not mention any statistics of the usage of the demo deployment. No further evidence of impact is mentioned in the paper.
Potentially, if the tool is adopted by DBpedia or Wikidata, the tool could impact the completeness of data in the respective KG. However, the mistakes produced by the tool would decrease the quality of data in the KG, therefore, such an adoption is not an immediate next step and would require a detailed strategy.

Unfortunately, the repository does not contain the description of neither how to run the experiment nor how to deploy the server. The license is MIT.

Overall the paper is easy to follow. However, the paper contains many typos and incorrect expressions (for example, page 2 line 7: "when there is not structured information available" not -> no; page 1, line 40: "Notice that mappings are manually create" create => created; abstract: "how textual information can be applied to existing semantic datasets"; etc.). Moreover, at times the notation is not introduced (for example, page 2, line 5: what are "features"?; page 2, line 8: what is "context infromation"?; Fig. 5: variables are not introduced; Fig. 2: how does the communication with "Feedback Module" happens - not via APIs?).

My main concern is raised by the methods chosen by the authors. On page 4 lines 22-24 authors claim "Using of SVMs for text classification has been proven to be efficient, performing at the state-of-the-art level" and back up this judgement with a reference (in the form of a footnote ! ) to a book that dates to 2008. Also, the "related work" section is missing an overview of modern methods of NER task, in particular the fine-grained NER field. Looking at the state of the art for related tasks, for example, NER on Ontonotes (, Entity Typing on Open Entity (, NER on CONLL 2003 ( as of April 2021 we do not see any SVM-based methods. Moreover, all the top-3 methods are based on transformer-based deep networks -- see references at the provided links. Indeed, I did not find any work that tackles DBpedia typing using those deep models. Yet, transformers have advanced state of the art in many NLP tasks and first of all NER on most known benchmarks and I do not see any objective reason why they would not perform well on DBpedia typing -- the task considered in this paper. Therefore, I think that the paper CANNOT be accepted without conducting an experiment on the same data with transformer-based models.

Further remarks:

page 1 line 25-26 in the abstract: "combining different standard NLP techniques and exploiting complementary information available to extract relevant features for different classifiers." does not really give any insights into the actual approach. It would be better if the reader could have some impression of the method already from the abstract.

The paper should be proof-read to remove numerous typos and misprints.

References contain 47 papers, however, I do not find papers beyond [18] mentioned anywhere in the current work.

Both lemmatizer and stemmer are used. Why did you choose to use both tools?