Multi-viewpoint ontology construction and classification by non-experts and crowdsourcing: the case of diet effect on health

Tracking #: 1056-2267

Authors: 
Maayan Zhitomirsky-Geffet
Eden S. Erez
Judit Bar-Ilan

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
Abstract: 
Domain experts are skilled to build a narrow ontology that reflects their sub-field of expertise. It will be grounded on their work experience and personal beliefs. We call this type of ontology a single-viewpoint ontology. There can be a variety of such single viewpoint ontologies for a given domain that represent a wide spectrum of sub-fields and expert opinions on the domain. But to have a complete formal vocabulary for the domain they need to be linked and unified into a multi-viewpoint model, while having the viewpoint statements marked and distinguished from the objectively true statements. We propose and test a methodology for multi-viewpoint ontology construction by non-expert users and crowdsourcing. The proposed methodology was evaluated in a large-scale crowdsourcing experiment with about 750 ontological statements. Typically, in crowdsourcing experiments the workers are asked for their personal opinions on the given subject. However, in our case their ability to objectively assess others' opinions is examined as well. Our results show substantially higher accuracy in classification for the objective assessment approach compared to the experiment based on personal opinions.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Jun/2015
Suggestion:
Reject
Review Comment:

# Article summary

This article presents an approach to multi-view ontology development applied to the particular scenario of diet and health. The authors focus their research on the process of producing an ontology with multiple non-experts. In order to do so, the article describes a methodology in which the authors 1) selected the most disputable subtopics (“based on advice of a senior clinical diet expert“); 2) built a unified ontology with ontological triples idefined by several information specialists, who searched for knowledge on the different subtopics at professional academic and governmental medical Web sites such as PubMed; and 3) asked the crowd to classify the obtained ontological statements as “true“ statement, “viewpoint“ statement“ or “erroroneous“ statement“. The authors measured the quality of executing the third step with crowd workers by comparing the crowd responses with a gold standard generated by a different group of information specialists, who classified the statements based on available literature. The crowd classification was executed using two variations: asking crowd workers for their own opinion (“subjective assessment“ as the authors call it), and asking crowd workers for others' opinion (“objective assessment“). The authors found out that with the latter they obtained results of higher accuracy.

# Originality

With the raise of crowdsourcing and collaborative technologies, managing multiple views in ontology engineering is gaining momentum. The authors present the methodology as their major contribution; however, the methodology itself does not provide a very novel solution compared to the state of the art. What the authors do is to use microtask crowdsourcing to label the data produced by information specialists, in order to have ontological statements classified into “truth“, “viewpoint“ and “error“. A positive aspect is that the authors try to compare subjective and objective formulations of their microtasks. As a case study, having a multi-view diet and health might not be common, as highlighted by the authors.

# Significance of results
The topic of aggregating the knowledge of multiple humans, who might differ in the way they process and interpret information, is a very relevant topic for the area of ontology engineering. Diverse perspectives in knowledge engineering can be useful for developing a richer model, as well as for identifying flaws in the represented knowledge.

While the topic and the technologies used in the article are relevant and interesting for the HC&C community, I identify several limitations in the definition and evaluation of the methodology, which is presented as the major contribution of the submission (see details below). A positive aspect is that the authors analyzed different evaluation measures and aggregation methods. However, for a methodology to be evaluated I think that it should have been analyzed across different domains and in more detail.

## Managing multiple viewpoints
* Given that the focus of this work is to handle multiple views, the way in which the process starts (i.e. identifying the most disputable sub-topics with the advice of a unique domain expert) is quite restrictive. The authors admit that domain experts can have a single-narrow viewpoint; therefore, it could happen that the expert identified the set of disputable subtopics in a very peculiar way, far from the literature and the community of domain experts. Even the granularity of the viewpoints classification may vary considerably from one person to another. Would not it be more natural, to start the process with multiple humans, even if they were not domain experts, and analyze automatically the emergence of controversial subtopics? That is, to start with information specialists directly, introducing only little guidance on the target subtopics.
* The article suggests that the step in which several ontologies are combined and normalized (sections 3.1.3, 3.1.4) was done manually by the authors. If the methodology was intended to be executed by other researchers or users who were not ontology engineers, who would be in charge of such step? Would it be feasible to automatize this step? And if so, how? This looks like one of the most critical and challenging parts of aggregating knowledge from different sources, and it is carried out manually by the authors.
* In order to evaluate the methodology, I would suggest the authors to run their experiments in multiple knowledge domains – not only in diet and healt. This would enable a deeper analysis and it could prove the methodology to be generalizable. There might be domains in which the disagreement might be at different levels, and there migth be domains in which the crowdsourcing experiment might require further quality assurance measures.
* The authors highlight three research questions. However, the evaluation focused on the crowdsourcing step (i.e. third research question). How would the authors consider the first and second questions to be answered? Did they consider and compare alternatives for building a multi-view ontology or the gold standard? These were challenges of the work, which were addressed in a particular way but I wonder if one can claim that they were “tested research questions“. The first question mentions “comprehensive multi-view ontology“: how is comprehensiveness measured there? How can the authors determine if the ontology built out of the information specialists if of high or low quality? Is this measured by the number of errors and viewpoints identified by the crowd classification?
* The authors say that “semantic contradictions were intentionally left in the ontology to reflect diverse viewpoints on the domain“; however, after the crowd classification, statements are said to be split into “true statements“, “viewpoints“ and “errors“ – and these semantic contradictions fall into the category of viewpoints. It is not clear if these statements are specifically annotated as “truth“, “viewpoint“ and “error“ within the RDF triples. How is this aggregated knowledge meant to be used or exploited, for example in a decision support system? What would be the goal of identifying viewpoints?

## Experiments
* A positive aspect of the evaluation is that the authors compared several aggregation methods.
* One of the findings of this work is that when workers are asked to give their opinion, they select more radical answers (i.e. Something is true or false) while when they are asked to say what they think other would answer, they select more “grey-scale answers“. Would not it be fairer to compare the same possible answers in a first person (I think...) and third person (experts think...) form? Did the authors looked at each worker's set of answers, to see if they were repeatedly selecting strong or grey-scale answers?
* Another finding is that crowd workers seem to agree more when they say what they think experts think. Do the authors have an interpretation for this result?
* “The viewpoints were the hardest cases to be found“ but they are actually the most interesting ones, because finding contradictions or being aware of them is the most difficult part of the process. I wonder if this would be analogous in another domain. Usually, people tend to believe in all kinds of diets, and therefore, crowd workers might be biased towards admiting that things are always true in most of the cases.
* Why did the authors exclude the definitions of concepts? (said in 3.3) They could compare their results with and without definitions, indeed.
* Were crowd workers allowed to search on the Web for knowledge?
* Did the authors consider to ask crowd workers to provide an example of a contradiction when they select that something is not always true?
* Why did the authors did not include a fourth case, in which the subjective version is also shorter, like in Experiment 1?
* The authors read from Table 2 that SVM and MLP are the methods which elicited the best results, however as it is highlighted in the table, majority vote of 75% (popular among crowdsourcing requesters) also provides high accuracy results when workers are asked about experts' opinion. What is the analysis from such results? Are the situations in which a particular kind of experiment (subjective / objective) and a particular aggregation method is more recommended?
* Did the authors check whether formulating the questions indicating which kind of experts and in which context, would they say something is true or not, changes anything in the results?
* Did the authors compare crowd workers’ collective annotation to the labels provided by the information specialists while generating the gold standard? Was there disagreement in the same degree and on the same statements?

# Quality of Writing

The article is written in correct English, it is readable and there are only few typos:
* section 3.1.2 incomplete sentence “The participants were given a list of upper level concepts and properties (such as)”
* section 3.2.1 delete “e” at “ To this end, at first, a questionnaire was e devised for crowdsourcing workers”

However, I think that the text lacks specificity in some parts (see more details below).
The text could also be improved by restructuring some parts, especially to make the text more aligned with the standard elements of scientific publications (suggestions listed below). Illustrations are quite clear; however as a minor suggestion I would encourage the authors to change the colours of figures 5,6,7 because when the article is printed in black and white, the bars are not easily distinguishable. Moreover, including the names of the X axis and Y axis in the figures could also make the figures more readable.

## Suggestions for improving clarity

* In the introduction, the authors justify the use of crowdsourcing arguing in the introduction that “in addition to high price and low availability, experts tend to build narrow single-viewpoint ontologies which capture their individual opinions, beliefs and work experience, but which might be unacceptable for other experts” and “non-expert subjects and particularly information specialists specifically guided can be more objective and thus accurate than domain experts”. I agree on crowdsourcing providing an easier way to get humans involved, and therefore the authors decide to work with non experts. However, the reader may wonder about the facts behind the second statement and the way such objectiveness is measured
* The concepts “truth“, “viewpoint“ and “error“ need to be precisely and formally defined from the beginning. The description that is provided in the gold standard section should be extracted and hihglighted in previous sections, in order to make the reader have a better understanding about this classification.
* In my opinion, the authors should explain more prominently from the beginning the purpose of each of the steps (e.g. what is the task of crowd workers in the process). The reader should clearly understand what the role of e.g. crowd workers is. The use of crowdsourcing is justified in the introduction and the research questions suddenly mention “classification“, but the reader might only understand the point of every step after reading details on the experimental setup. In my opinion, this should be improved, because the reader might want to know whether the methodology is suitable or not for her requirements.
* In my opinion the text should describe more clearly what “experts“ and “non-experts“ are. The authors could distinguish between domain and task experts (e.g. information specialists).
* Some of the explanations in Section 3 (methods) are imprecise. For example, when describing the crowdsourcing experiment (section 3.2.1) the authors could provide more details about the way the microtasks are generated and configured. What do the qualification tests contain exactly? How were the questions selected when designing the qualification tests? How could they be created when applying such methodology to other knowledge domains? Moreover, the authors say they received “30,000 individual worker judgments“: how many per microtask? What was the average microtask completion per worker?
* The description about the experimental setup should be separated from the methods. There should be a distinction between methods and experimental evaluation.
* A more detailed description of the data used (i.e. Statements) would also be convenient.
* All the elements used in the evaluation should be clearly defined and there should be a better distintiction between evaluation measure (accuracy) and the aggregation method. Both things should be described more precisely (e.g. more details on the use of methods like SVM should be given).
* There are descriptions in the experiments sections that should be rearranged. For example, providing details on the way qualification tests are defined (with true and false statements) should not be included in “results and discussion“ but in experimental setup instead.
* It is not clear why the authors decide to evaluate the “performance of two types of classification, 3-class and 2-class classification“ within the crowdsourcing experiment, since they repeatedly make the distinction of the three kind of statements over the publications.
* Do the information specialists generating the gold standard see the sources supporting the statements produced in steps 3.1.3 – 3.1.4? How many humans should be involved in such gold standard generation usually? Should they satisfy any requirement?
* It would be very convenient to introduce a section with a more detailed description and analysis of cases (i.e. ontological statements) on which crowd workers agree and disagree.

# Open access to research resources

* Unfortunately, the RDF ontology was not found in the specified project Website. Two Excel data files are downloadable, but I could not find the ontology that is mentioned in the Website as “The ontology is based on a few hundreds scientific articles and currently comprises over 750 assertions (RDF-style triples) ”.
* I would encourage the authors to provide access to the microtask templates, or at least access to a CrowdFlower sample job, including a qualification test. Having access to the set of statements which are directly given to the crowd (as a data dump) would also be convenient.

# Other comments and suggestions

* The authors have cited one of the early works of Aroyo et al. on CrowdTruth. However, there are currently several publications on this framework, which the authors could probably find very interesting.
** L Aroyo, C Welty. The three sides of crowdtruth. Journal of Human Computation, 2014
** O Inel, et al. CrowdTruth: machine-human computation framework for harnessing disagreement in gathering annotated data. In: International Semantic Conference (ISWC), 2014
* The work of Tudorache et al. on collaborative ontology engineering might also be relevant for this work:
** Tudorache, T, Nyulas, C, Noy, N. F, and Musen, M. A. (2013)b. WebProtégé: A collaborative ontology editor and knowledge acquisition tool for the web. Semantic web 4, 1 (2013), 89–99.
** Tudorache, T, Nyulas, C, Noy, N. F, Redmond, T, and Musen, M. A. (2011). iCAT: A Collaborative Authoring Tool for ICD-11. In (OCAS2011) 10 th International Semantic Web Conference (ISWC), 2011.
* The authors explain that they introduce two new aspects (compared to their previous work): using not only “is-a-kind-of“ relantionships and not asking only about true/false statements. However, this is also aligned with current state-of-the-art approaches.
* The authors explain that the humans involved in gold standard generation had discussions to reach consensus. Did the authors consider enabling such discussions between crowd workers? Current platforms do not offer such features, but there are means to enable this outside the platforms (e.g. in a separate Web site). Work on argumentation techniques could be relevant.
* Did the authors consider having a qualification test that not only leaves out robots and cheaters, but also humans who do not have a minimum level of domain knowledge?
* Given that microtask crowdsourcing involves money, it would be interesting to research on optimization strategies. That is, an algorithm that could automatically process the statements and cited sources produced by the information specialists before the crowd annotation. An algorithm that could reduce the amount of data to be processed by the crowd. I would encourage the authors to investigate this direction, since something like this could improve considerably the contribution of this work. Currently, the most valuable aspect of the methodology is to use crowdsourcing for annotating statements into “true“, “viewpoint“ and “error“. Even if the evaluation uses several standard methods, the method is quite aligned with state-of-the art approaches using crowdsourcing for labeling data. Extending the approach with an automatic method to assess the nature of statements (true / viewpoint / error) could be a valuable contribution to this work.

# Overall Recommendation

Given the aforementioned limitations of this work and considering that this is a submission to a journal, my recommendation is to reject the submission. However, I would encourage the authors to continue and improve their work, as the topic of combining multiple and conflicting views in ontology engineering is a very interesting field. Having approaches that handle this situation might also be very useful for the adoption of Semantic Web technologies.

Review #2
By Philippe Cudre-Mauroux submitted on 09/Jul/2015
Suggestion:
Minor Revision
Review Comment:

In this paper, the authors tackle the problem of multi-viewpoint ontology construction be leveraging non-experts and crowdsourcing. The idea is to build ontologies that capture a variety of viewpoints on specific facts (e.g., by annotating triples based on three labels: absolute truth, viewpoint, error). The authors build a multi-viewpoint ontology by combining several single-viewpoint ontologies focusing on different facets of dietary research, ask the crowd to check or give their opinion on the resulting assertions, and finally aggregate the crowd answers leveraging several state-of-the-art methods. They empirically show that crowd workers can accurately classify statements in a multi-viewpoint ontology and distinguish between true, viewpoint and erroneous statements for a given professional domain with over 90% accuracy.

Overall, I feel like this is an exciting article presenting a down-to-earth, pragmatic but scientifically sound methodology focusing on how crowd workers could help build multi-viewpoint ontologies. The topic tackled is of great importance, as a lot of attention has recently been given to multi-viewpoint ontologies or statements (e.g., in the biomedical domain), and on their potential evaluation using non-experts and/or crowd workers.

The paper is generally speaking clear and very well written. The introduction gives a solid motivation for the work, but is in my opinion too vague in terms of the methodology introduced; it would in my opinion be much clearer to prominently introduce the 3-phase method in the introduction already, contrary to the current text that only hints at the method.

Section 3 introduces the methodology in greater detail. The individual technical steps used to come up with the multi-view ontology (e.g, ontology merging, micro-task crowdsourcing, aggregation using majority vote, Bayesian approaches or SVM) are not novel per se; however, this piece of work is to the best of my knowledge the first one to combine them in an end-to-end methodology and to empirically evaluate it through a large-scale experiment.

The methodology introduced for the approach is overall sound. However, I have some doubts on the design on Experiment 3, and more specifically on possible answers 4 and 5, which sound relatively close in terms of their formulation / semantics (I believe that many crowd workers would consider them as semantically similar). The authors should discuss the rationale for including both statements as well as the potential bias introduced by the workers clicking on one or the other indifferently.

One dimension that is missing in the discussion is the monetary aspect. The author should comment on the total budget of the experiment, and on the potential tradeoff between monetary budget and results quality for the different scenarios they consider.

Also, I feel like an experiment/discussion on qualification tests is missing. The results of the crowdsourcing task is obviously directly related to the worker selection process, as the experiment focuses on pretty technical questions and on scientific literature mining. In that context, discussing various qualification tests would be highly relevant (e.g., discussing the quality of the results obtained by consider no qualification test, or tests of various complexity).

Review #3
By Miriam Fernandez submitted on 14/Jul/2015
Suggestion:
Major Revision
Review Comment:

The following paper presents methodology for multi-view point ontology construction by non-experts.

The idea of the paper, that of reflecting multiple points of view within an ontology is very interesting and may have multiple applications, specially within controversial domains. However, I have several concerns about the presented methodology and its evaluation.

Regarding the research question (1) whether and how a group of non-expert subjects can produce collaborative a comprehensive multi-view point ontology, seems that the proposed methodology requires a significant amount of additional steps (apart from the ones performed by non-expert subjects) and the methodology to perform these steps (as far as I can see in the paper) has not been formalised.
- The selection of several viewpoints (subtopics), which constitute the bases of the ontology, is required based on the knowledge of a domain expert. However what is the methodology behind selecting some subtopics and not others for a given domain is not specified. Additionally, it is argued in the paper that domain experts may have a bias over their own opinion. In this particular case, they may also have a bias on selecting which viewpoints or subtopics are more relevant / more controversial.
- Once the view points have been selected a set of concepts and relations covering those view points need to be pre-selected. However, it is not clear by the current narrative who selects those initial concepts and relations, what is the rationale behind their selection, and what are the resource(s) used to extract them.
- When performing the individual ontology construction (3.1.2) it is also unclear what is the methodology proposed to select the resources (relevant literature) used for the task, i.e., who selects which resources are trustable / not and based on what?
- When performing the single-viewpoint ontology creation (as far as I understand this is performed by the non-expert users working on the same viewpoint) that paper mentions that:
o “ontological errors and inconsistencies are fixed” -> unclear how non-expert users, without predefined knowledge of ontology engineering, can identified certain types of ontological errors and inconsistencies.
o Statements that are not related to the given subtopic are removed / synonyms are unified (this may lead to disagreements among the participants over whether the statement belongs or not to the domain or whether two synonyms should really be the same concept). It is not specified how disagreements are resolved in such cases.
o Statements that semantically contradicted each other were deleted. I can’t really understand the rationale behind this part of the methodology. As far as I understand the whole purpose is to allow multiple points of view within the ontology. However, seems that you only allow different points of view across the selected subtopics but not within a subtopic. Any reason why?

Regarding the generation of the gold standard (section 3.2.2)
- I don’t understand how is it possible that the information specialist detected statements coming from non academic sources (errors) if in the methodology the non-expert users are restricted to use academic resources when creating their statements (“Wikipedia, blogs, social networks and other non-academic sites were *not allowed* to be used)

Regarding the performed evaluation:
- It is specified in the text that only “true / view point / error” annotations were provided by the specialist when creating the gold standard. Note that this is a 3-point scale while experiments 2 and 3 use a 5-point scale (when assessing correctness for experiments 2 and 3 answers need to be grouped). Note that the grouping from the 5-point scale to the 3-point scale (or the 2-point scale) may be arguable, and therefore compromised the obtained results. It is mention that assertions 1 and 2 correspond to true statements, 3, 4, to view points and 5 errors. However for experiment 2, assertion 2 (“all experts agree that this statement is correct in some cases”) can be understood as view point by the workers (since it is not true for all cases).
- Also the phrasing of sentences for experiments 2 and 3 is different and the conclusion that workers are better at assessing other’s opinion than their own may be incorrect. Note that while in experiment 2 it is used “some cases”, for experiment 3 it is used “most cases”. This changes the meaning of the sentence, and therefore may introduce a bias in the comparison between experiments 2 and 3.
- It is unclear to me how statements are represented in terms of features for SVM and MLP. It is mentioned in section 4 that statements were divided in task of task of 40 statements (with one of 26). A total of 19 tasks if I am not mistaken. Did all 40 workers perform all tasks, i.e., assess each statement, (and therefore each worker’s vote is considered as individual feature?). If not please note that: (i) some statements may have been assessed by more workers than others leading to feature vectors of different sizes or, (ii) the judgement of a different person should be considered as a different feature.

Regarding the provided ontology, I haven’t been able to find the RDF version of it. Only the excel files (golden data and model’s classifications). Also, no explanations are provided about what each of the columns of those files mean.

In summary, while I think that the idea of the presented paper is very interesting, important issues regarding the proposed methodology and its evaluation may need to be resolved. I would have also loved to see how the proposed ontology is finally modelled and how controversial statements (different view points) are reflected on it (but right now seems the ontology is not accessible under the provided link)

Other issues:

Section 3.2.1: was e devised -> was devised
Section 3.2.1: determied -> determined
Section 4: Forty statements were excluded from the test set (you mean 30, the other 10 were manually created right?)