Empirical Methodology for Crowdsourcing Ground Truth

Tracking #: 1569-2781

Anca Dumitrache
Oana Inel
Benjamin Timmermans
Carlos Ortiz
Robert-Jan Sips
Lora Aroyo

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, ambiguity in the data, as well as a multitude of perspectives of the information examples are continuously present. In this paper we present an empirically derived methodology for efficiently gathering of ground truth data in a number of diverse use cases that cover a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics, capturing inter-annotator disagreement. In this paper, we show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 27/Apr/2017
Major Revision
Review Comment:

This revised version has been changed significantly. The paper is well written and focuses on a new approach to aggregate answers from different crowd workers answering the same task based on agreement metrics. The evaluation covers both closed and open-ended tasks including a good choice of different datasets and tasks.

The main findings (listed at the end of section 1) are not new (i.e., using more annotators improves the quality of the collected answers) and not important (i.e., the proposed approach performs better than a weak baseline). Overall, the novel research contribution appears limited.

Major comments:
- The discussion on the trade-off cost/quality due to involving more workers to answer a certain task should be expanded.
- The proposed method should be presented in more detail also including a model comparison with related methods (e.g., those described in section 6.3). Section 2.2 should also compare the proposed approach with standard annotator agreement metrics.
- Stronger baselines should be used to experimentally compare the proposed method (e.g., those described in section 6.3).

Other suggestions:
- Provide an example of majority vote aggregation for open-ended tasks in section 3.2
- It is unclear how manual evaluation has been performed over expert labels in section 3.3

Review #2
By Gerhard Wohlgenannt submitted on 01/May/2017
Minor Revision
Review Comment:

The paper suggests a method for generating ground truth from crowdsourcing input which
claims to be better than using majority voting to decide about correct annotations from ambiguous worker input.
The method is evaluated in 4 domains, with both closed and open-ended answer types on the respective task units.
As one of the main metrics they compute the cosine between the worker annotations vector and the unit vector of the annotation, and select annotations as ground truth if the cosine exceeds a pre-defined threshold.
They also show that a higher number of workers per unit (up to some point) improves the quality of results.

The paper is well written and understandable, the goals of the work are quite clear. The approach has been evaluated in 4 different domains with different types of crowdsourcing tasks and input data.

The main metric of media unit annotation score is not overly impressive,
I would expect that the result quality will be similar if the authors just lowered the threshold for the majority vote,
and for example had accepted all annotations supported by for example 30% of crowdworkers (instead of >= 50%) --
depending on the task.

A main concern is, that depending on the task and the answer type,
the optimal "media unit annotation score threshold" is very different for the 4 use cases,
and if the "media unit annotation score threshold" is not chosen in the right range,
then results will be worse than with simple majority votes.
So my first request is to give clear instructions on how to choose a suitable threshold value,
as that will make the approach more helpful for researchers applying the method.
The experiments suggest that a higher threshold is better in closed answer type situations,
and a lower threshold better for open-ended answers -- but there should be a more solid suggestion for selecting the threshold.

One very interesting aspect is the detection of spam workers, the authors present some
heuristics for that, but do not evaluate them in the paper -- so it is hard to judge if the measures
are successful. Request no 2 (which is optional): If the authors have some evaluation data
about the quality of spammer detection, please present it in the paper.

In section 3.3 the authors manually re-label the judgments from domain experts and
add data from the crowdworkers into the ground truth -- thereby biasing the results of the evaluation of the quality of crowdtruth.
The authors should state clearly why they think manual relabeling is helpful and necessary.

Overall for me the paper is on the borderline, but it contains some interesting work
and I suggest to accept the paper, given the minor corrections addressing the points mentioned.

Review #3
Anonymous submitted on 06/Aug/2017
Minor Revision
Review Comment:

I would like to the thank the authors for their response. The focus and clarity of the paper have been definitely improved. The new version of the manuscript addresses my major concerns raised in my previous review (Reviewer #2).

Still, I have some remarks that the authors should address in the paper before publication.

1. Clarification about the definition of CrowdTruth:
Throughout the paper, the authors refer to CrowdTruth as a method, a framework, a methodology. Furthermore, the authors should clarify the main difference of the CrowdTruth ‘methodology’ proposed in the current manuscript in comparison with the work presented in [21]. This will help readers to better understand the contributions of this work.

2. ‘Triangle’ or ‘pyramid’ of disagreement:
The authors seem to use the proposed terms ‘triangle of disagreement’ (e.g., the header of Section 2.1) and ‘pyramid of disagreement’ (e.g., the caption of Figure 2) interchangeably. It would be clearer if the authors stick to only one term.

3. Settings of the CrowdTruth metrics in the experiments:
In the reported experiments, it is clear that the authors explore the impact of using different threshold values for the ‘media unit annotation score’ metric on the quality of the crowd answers. Nonetheless, the experimental settings do not describe the usage of the other seven CrowdTruth metrics:

3.1) Were the other metrics also considered for generating the crowd results with CrowdTruth?

3.2.) What were the thresholds used in the other metrics? How were these thresholds configured? (if applicable)

If the scope of this paper is to investigate the impact of only one CrowdTruth metric, then this should be clarified in the paper.

4. CrowdFlower settings in the experiments:
In their response, the authors explain that besides the configurations reported in the paper, they used the default CrowdFlower settings. This clarification should definitely be included in the paper. Still, there are a couple of settings that should be further described:

4.1) When were the microtasks of each use case submitted to CrowdFlower?

4.2) The authors explain in Section 3.1. that the payment per task was gradually increased. What was the starting payment and maximum payment rewarded in each type of task? In addition, the authors report the cost/judgment in Table 3, but it is unclear what is the relation of this cost and the payment configuration.

5. Computation of precision and recall:
The conducted evaluation reports on the micro-F1 score to measure the quality of the studied crowdsourcing methodologies. This avoids biases based on the size of the ‘classes’ in the dataset. However, it is unclear what the definition of ‘class’ is in this context.

6. Collection of Trusted Judgements:
In Section 5, the authors mention that for the task ‘sound interpretation’ all the answers collected from the crowd were accepted as part of the trusted judgments. However, this is not mentioned in Section 3.3 This clarification should be included before presenting the results.

Further minor comments:
- Page 4, w is not defined in wwa(w), wsa(w), and na(w).

- Page 12, the following passage is difficult to follow: “According to our theory of the disagreement triangle, where the ambiguity of the task propagates in the crowdsourcing system affecting the degree to which workers dis- agree (i.e. the optimal number of workers per task), and the clarity of the unit (i.e. the optimal media unit- annotation score threshold).”

- The sentence “an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.” appears identically in the abstract and the conclusions. This should be avoided.