Review Comment:
# Article summary
This article presents an approach to multi-view ontology development applied to the particular scenario of diet and health. The authors focus their research on the process of producing an ontology with multiple non-experts. In order to do so, the article describes a methodology in which the authors 1) selected the most disputable subtopics (“based on advice of a senior clinical diet expert“); 2) built a unified ontology with ontological triples idefined by several information specialists, who searched for knowledge on the different subtopics at professional academic and governmental medical Web sites such as PubMed; and 3) asked the crowd to classify the obtained ontological statements as “true“ statement, “viewpoint“ statement“ or “erroroneous“ statement“. The authors measured the quality of executing the third step with crowd workers by comparing the crowd responses with a gold standard generated by a different group of information specialists, who classified the statements based on available literature. The crowd classification was executed using two variations: asking crowd workers for their own opinion (“subjective assessment“ as the authors call it), and asking crowd workers for others' opinion (“objective assessment“). The authors found out that with the latter they obtained results of higher accuracy.
# Originality
With the raise of crowdsourcing and collaborative technologies, managing multiple views in ontology engineering is gaining momentum. The authors present the methodology as their major contribution; however, the methodology itself does not provide a very novel solution compared to the state of the art. What the authors do is to use microtask crowdsourcing to label the data produced by information specialists, in order to have ontological statements classified into “truth“, “viewpoint“ and “error“. A positive aspect is that the authors try to compare subjective and objective formulations of their microtasks. As a case study, having a multi-view diet and health might not be common, as highlighted by the authors.
# Significance of results
The topic of aggregating the knowledge of multiple humans, who might differ in the way they process and interpret information, is a very relevant topic for the area of ontology engineering. Diverse perspectives in knowledge engineering can be useful for developing a richer model, as well as for identifying flaws in the represented knowledge.
While the topic and the technologies used in the article are relevant and interesting for the HC&C community, I identify several limitations in the definition and evaluation of the methodology, which is presented as the major contribution of the submission (see details below). A positive aspect is that the authors analyzed different evaluation measures and aggregation methods. However, for a methodology to be evaluated I think that it should have been analyzed across different domains and in more detail.
## Managing multiple viewpoints
* Given that the focus of this work is to handle multiple views, the way in which the process starts (i.e. identifying the most disputable sub-topics with the advice of a unique domain expert) is quite restrictive. The authors admit that domain experts can have a single-narrow viewpoint; therefore, it could happen that the expert identified the set of disputable subtopics in a very peculiar way, far from the literature and the community of domain experts. Even the granularity of the viewpoints classification may vary considerably from one person to another. Would not it be more natural, to start the process with multiple humans, even if they were not domain experts, and analyze automatically the emergence of controversial subtopics? That is, to start with information specialists directly, introducing only little guidance on the target subtopics.
* The article suggests that the step in which several ontologies are combined and normalized (sections 3.1.3, 3.1.4) was done manually by the authors. If the methodology was intended to be executed by other researchers or users who were not ontology engineers, who would be in charge of such step? Would it be feasible to automatize this step? And if so, how? This looks like one of the most critical and challenging parts of aggregating knowledge from different sources, and it is carried out manually by the authors.
* In order to evaluate the methodology, I would suggest the authors to run their experiments in multiple knowledge domains – not only in diet and healt. This would enable a deeper analysis and it could prove the methodology to be generalizable. There might be domains in which the disagreement might be at different levels, and there migth be domains in which the crowdsourcing experiment might require further quality assurance measures.
* The authors highlight three research questions. However, the evaluation focused on the crowdsourcing step (i.e. third research question). How would the authors consider the first and second questions to be answered? Did they consider and compare alternatives for building a multi-view ontology or the gold standard? These were challenges of the work, which were addressed in a particular way but I wonder if one can claim that they were “tested research questions“. The first question mentions “comprehensive multi-view ontology“: how is comprehensiveness measured there? How can the authors determine if the ontology built out of the information specialists if of high or low quality? Is this measured by the number of errors and viewpoints identified by the crowd classification?
* The authors say that “semantic contradictions were intentionally left in the ontology to reflect diverse viewpoints on the domain“; however, after the crowd classification, statements are said to be split into “true statements“, “viewpoints“ and “errors“ – and these semantic contradictions fall into the category of viewpoints. It is not clear if these statements are specifically annotated as “truth“, “viewpoint“ and “error“ within the RDF triples. How is this aggregated knowledge meant to be used or exploited, for example in a decision support system? What would be the goal of identifying viewpoints?
## Experiments
* A positive aspect of the evaluation is that the authors compared several aggregation methods.
* One of the findings of this work is that when workers are asked to give their opinion, they select more radical answers (i.e. Something is true or false) while when they are asked to say what they think other would answer, they select more “grey-scale answers“. Would not it be fairer to compare the same possible answers in a first person (I think...) and third person (experts think...) form? Did the authors looked at each worker's set of answers, to see if they were repeatedly selecting strong or grey-scale answers?
* Another finding is that crowd workers seem to agree more when they say what they think experts think. Do the authors have an interpretation for this result?
* “The viewpoints were the hardest cases to be found“ but they are actually the most interesting ones, because finding contradictions or being aware of them is the most difficult part of the process. I wonder if this would be analogous in another domain. Usually, people tend to believe in all kinds of diets, and therefore, crowd workers might be biased towards admiting that things are always true in most of the cases.
* Why did the authors exclude the definitions of concepts? (said in 3.3) They could compare their results with and without definitions, indeed.
* Were crowd workers allowed to search on the Web for knowledge?
* Did the authors consider to ask crowd workers to provide an example of a contradiction when they select that something is not always true?
* Why did the authors did not include a fourth case, in which the subjective version is also shorter, like in Experiment 1?
* The authors read from Table 2 that SVM and MLP are the methods which elicited the best results, however as it is highlighted in the table, majority vote of 75% (popular among crowdsourcing requesters) also provides high accuracy results when workers are asked about experts' opinion. What is the analysis from such results? Are the situations in which a particular kind of experiment (subjective / objective) and a particular aggregation method is more recommended?
* Did the authors check whether formulating the questions indicating which kind of experts and in which context, would they say something is true or not, changes anything in the results?
* Did the authors compare crowd workers’ collective annotation to the labels provided by the information specialists while generating the gold standard? Was there disagreement in the same degree and on the same statements?
# Quality of Writing
The article is written in correct English, it is readable and there are only few typos:
* section 3.1.2 incomplete sentence “The participants were given a list of upper level concepts and properties (such as)”
* section 3.2.1 delete “e” at “ To this end, at first, a questionnaire was e devised for crowdsourcing workers”
However, I think that the text lacks specificity in some parts (see more details below).
The text could also be improved by restructuring some parts, especially to make the text more aligned with the standard elements of scientific publications (suggestions listed below). Illustrations are quite clear; however as a minor suggestion I would encourage the authors to change the colours of figures 5,6,7 because when the article is printed in black and white, the bars are not easily distinguishable. Moreover, including the names of the X axis and Y axis in the figures could also make the figures more readable.
## Suggestions for improving clarity
* In the introduction, the authors justify the use of crowdsourcing arguing in the introduction that “in addition to high price and low availability, experts tend to build narrow single-viewpoint ontologies which capture their individual opinions, beliefs and work experience, but which might be unacceptable for other experts” and “non-expert subjects and particularly information specialists specifically guided can be more objective and thus accurate than domain experts”. I agree on crowdsourcing providing an easier way to get humans involved, and therefore the authors decide to work with non experts. However, the reader may wonder about the facts behind the second statement and the way such objectiveness is measured
* The concepts “truth“, “viewpoint“ and “error“ need to be precisely and formally defined from the beginning. The description that is provided in the gold standard section should be extracted and hihglighted in previous sections, in order to make the reader have a better understanding about this classification.
* In my opinion, the authors should explain more prominently from the beginning the purpose of each of the steps (e.g. what is the task of crowd workers in the process). The reader should clearly understand what the role of e.g. crowd workers is. The use of crowdsourcing is justified in the introduction and the research questions suddenly mention “classification“, but the reader might only understand the point of every step after reading details on the experimental setup. In my opinion, this should be improved, because the reader might want to know whether the methodology is suitable or not for her requirements.
* In my opinion the text should describe more clearly what “experts“ and “non-experts“ are. The authors could distinguish between domain and task experts (e.g. information specialists).
* Some of the explanations in Section 3 (methods) are imprecise. For example, when describing the crowdsourcing experiment (section 3.2.1) the authors could provide more details about the way the microtasks are generated and configured. What do the qualification tests contain exactly? How were the questions selected when designing the qualification tests? How could they be created when applying such methodology to other knowledge domains? Moreover, the authors say they received “30,000 individual worker judgments“: how many per microtask? What was the average microtask completion per worker?
* The description about the experimental setup should be separated from the methods. There should be a distinction between methods and experimental evaluation.
* A more detailed description of the data used (i.e. Statements) would also be convenient.
* All the elements used in the evaluation should be clearly defined and there should be a better distintiction between evaluation measure (accuracy) and the aggregation method. Both things should be described more precisely (e.g. more details on the use of methods like SVM should be given).
* There are descriptions in the experiments sections that should be rearranged. For example, providing details on the way qualification tests are defined (with true and false statements) should not be included in “results and discussion“ but in experimental setup instead.
* It is not clear why the authors decide to evaluate the “performance of two types of classification, 3-class and 2-class classification“ within the crowdsourcing experiment, since they repeatedly make the distinction of the three kind of statements over the publications.
* Do the information specialists generating the gold standard see the sources supporting the statements produced in steps 3.1.3 – 3.1.4? How many humans should be involved in such gold standard generation usually? Should they satisfy any requirement?
* It would be very convenient to introduce a section with a more detailed description and analysis of cases (i.e. ontological statements) on which crowd workers agree and disagree.
# Open access to research resources
* Unfortunately, the RDF ontology was not found in the specified project Website. Two Excel data files are downloadable, but I could not find the ontology that is mentioned in the Website as “The ontology is based on a few hundreds scientific articles and currently comprises over 750 assertions (RDF-style triples) ”.
* I would encourage the authors to provide access to the microtask templates, or at least access to a CrowdFlower sample job, including a qualification test. Having access to the set of statements which are directly given to the crowd (as a data dump) would also be convenient.
# Other comments and suggestions
* The authors have cited one of the early works of Aroyo et al. on CrowdTruth. However, there are currently several publications on this framework, which the authors could probably find very interesting.
** L Aroyo, C Welty. The three sides of crowdtruth. Journal of Human Computation, 2014
** O Inel, et al. CrowdTruth: machine-human computation framework for harnessing disagreement in gathering annotated data. In: International Semantic Conference (ISWC), 2014
* The work of Tudorache et al. on collaborative ontology engineering might also be relevant for this work:
** Tudorache, T, Nyulas, C, Noy, N. F, and Musen, M. A. (2013)b. WebProtégé: A collaborative ontology editor and knowledge acquisition tool for the web. Semantic web 4, 1 (2013), 89–99.
** Tudorache, T, Nyulas, C, Noy, N. F, Redmond, T, and Musen, M. A. (2011). iCAT: A Collaborative Authoring Tool for ICD-11. In (OCAS2011) 10 th International Semantic Web Conference (ISWC), 2011.
* The authors explain that they introduce two new aspects (compared to their previous work): using not only “is-a-kind-of“ relantionships and not asking only about true/false statements. However, this is also aligned with current state-of-the-art approaches.
* The authors explain that the humans involved in gold standard generation had discussions to reach consensus. Did the authors consider enabling such discussions between crowd workers? Current platforms do not offer such features, but there are means to enable this outside the platforms (e.g. in a separate Web site). Work on argumentation techniques could be relevant.
* Did the authors consider having a qualification test that not only leaves out robots and cheaters, but also humans who do not have a minimum level of domain knowledge?
* Given that microtask crowdsourcing involves money, it would be interesting to research on optimization strategies. That is, an algorithm that could automatically process the statements and cited sources produced by the information specialists before the crowd annotation. An algorithm that could reduce the amount of data to be processed by the crowd. I would encourage the authors to investigate this direction, since something like this could improve considerably the contribution of this work. Currently, the most valuable aspect of the methodology is to use crowdsourcing for annotating statements into “true“, “viewpoint“ and “error“. Even if the evaluation uses several standard methods, the method is quite aligned with state-of-the art approaches using crowdsourcing for labeling data. Extending the approach with an automatic method to assess the nature of statements (true / viewpoint / error) could be a valuable contribution to this work.
# Overall Recommendation
Given the aforementioned limitations of this work and considering that this is a submission to a journal, my recommendation is to reject the submission. However, I would encourage the authors to continue and improve their work, as the topic of combining multiple and conflicting views in ontology engineering is a very interesting field. Having approaches that handle this situation might also be very useful for the adoption of Semantic Web technologies.
|