Review Comment:
The paper describes NLT4Types an online application that explores textual information to predict types for resources. The tool allows also to collect feedback from users for the suggested typed with the aim to enhance and complete Knowledge Graphs. NLP4types is evaluated on DBpedia and LDH gold standard and it outperforms state-of-the-art tools achieving a precision of around 94% on DBpedia.
The paper is generally well-written (with some English errors that are attached to this review) and easy to follow. As this paper is submitted as a tool/system report, the paper is more focused on the system itself and its contribution.
Despite being a nice and interesting approach and tool, I have a few concerns as follows:
The impact of this tool is given by its usability. Actually, I found the tool not so usable with problems such as the description area and the debugging window very small and unreadable, very long list of the suggested types, unable to suggest a new type, and no information about the use of the suggested types (see my detailed comments below). No information is provided for evaluation in the cases when a user suggests a wrong type as this would decrease the quality of the data. Many improvements should be still implemented.
What do you mean by DBpedia chapters?
I agree with the authors that very often resources have a wrong or a very general type. In this regard, did the authors do anything to include their inferred corrected types in DBpedia? How did they evaluate the correctness of the types given as feedback by the users?
As described in the methodology section, some of the steps are optional. How does the prediction module decide if the system should use the NER component or not? When reading the paper, I thought the user chooses to apply or not some steps, but I was not able to find this on the web app.
The feedback module is in charge of providing all the options for the user. Using the web app I could only see one suggested type.
User interface user friendly:
Firstly, the web app interface is quite simple. The description text area is very small, the same for debugging area. There is a lot of unused space in the app so what is the point to have such a small field to add text? One question here, is there a limit (min and max) in the number of words/ characters that can be included in the text description? The debugging window becomes really difficult to read and follow what is happening. Apart from the small size of the debugging window, it is very difficult to understand its content. I saw that this window will disappear once the demo tool is finished, so maybe that's the reason it does not contain useful information for the final user, but isn't it better to let the user know what is going on in the process? Of course, using a debugging window that has more readable information for the user who is using the tool.
Secondly, the application suggests only one type, of course, the one with the highest confidence, and the user should give feedback only regarding that type, and in case it is the wrong type then he/she should select from a list of types. Why did the authors choose this way? I mean isn't it better to give a set of possible types, and the user should for example rank them, or give a score for each of them? Often, users are not aware of all possible types that might be good candidates, and so they can annotate it with the wrong type.
Thirdly, the feedback module. The list of the suggested types is a very very long list and is not intuitive at all. From a user perspective, it would be simpler to have a shorter list with the most generic types, and then the user might explore or dig into the most specialized concepts of the selected one. The list is very misleading as for example if I know the type that is needed to annotate the text e.g., FictionalCharacter and I search for Character it doesn't point to FictionalCharacter at all as it searches for the types starting with Character. So I would suggest the authors to work on this part as it is a bit confusing to give the right feedback for the annotated types. Are the types in this list representing somehow the hierarchy of types in DBpedia?
Fourthly, there is no way to suggest a type that is outside of the list except for None in the list. If one should suggest a new type, how can he/she do?
Fifthly, the text and the types that a user annotates, where are stored? Are they evaluated by anyone? How are they used after? The title of the paper refers to humans in the loop but nothing is said in the paper about the evaluation of human feedback.
Sixthly, the usability of the tool. Are there any APIs? If a user wants to use NLP4types in their application, how should he do?
I tried two different examples in the app. Both of them were texts from tales. The first one was a text from Little Red Riding Hood and the second one was related to Cinderella. The type suggested for Little Red Riding Hood was while for Cinderella the suggested type is . Both of these types are wrong, more the second one. I could suggest the correct type and then tried again with the same text and the suggested result was the same type that the app suggested before. Are these suggested types used? And if yes, how and when?
"That is, our system learns from DBpedia data and produces types mimicking it, including the potential errors on the type assignment, whereas the manually annotated ones might diverge, thus lowering the accuracy of the predictions." Shouldn't it be the other way around? So the NLP4types should be mimicking the GS instead of DBpedia as it might include, as the claim of the paper, mistyped entities?
Moreover, I was expecting also an analysis of the confusion matrics. Which types are the hardest to be inferred? Why? Nothing was said throughout the paper on the ways to improve the tool's performance. Did the authors consider testing their system in Wikidata? Why did they not include it in the paper? I think very interesting things should come up.
English grammar:
Thus it is crucial to define means to infer this information -> Thus, it is crucial to define means to infer this information
many DBpedia datasets -> many DBpedia versions
mappings are manually create -> mappings are manually created
We have also to take into account that -> We have to take into account also that
The reminder of this paper is structured as follows. -> wrong references to the sections
in which binary classifiers for each type -> the verb is missing
The related gold standards provides -> The related gold standards provide
The gold standard provided are used -> The gold standard/s provided is/are used
twofold -> two fold
to to collect feedback -> to collect feedback
state of the art -> state-of-the-art
show and interesting behaviour -show an interesting behavior
is being to abstract (i.e. inferring types that are too generic) or to concrete -> is being too abstract (i.e. inferring types that are too generic) or too concrete
Table 1 show how -> Table 1 shows how
whereas the test set used for the gold standard evaluation is manually generated and curated one -> whereas the test set used for the gold standard evaluation is manually generated and curated
|