Review Comment:
Generally speaking, I like the paper and its topic and I think that it could worth publishing in SWJ, because it is strongly in line with the special issue CfP. Still, I have some major remarks that should be addressed before acceptance. My observations are mainly related to two aspects: the global scope of the paper and of the presented results, and the experiments design.
Regarding the paper scope, I am not convinced that the authors provided results that can be considered valid for Linked Data at large, nor for LD quality issues of any kind. This is quite adequately indicated in the paper title, but it is not fully reflected in the paper text itself.
The authors stated that they focused on DBpedia "as a representative data set for the broader Web of Data"; I largely disagree with that for the following reasons: (1) not all LD sources are produced by following a transformation/mapping process like DBpedia one, and the types of errors that happen in a specific LD source heavily depend both on the intrinsic quality of the source and on the possible translation process to RDF; (2) DBpedia is very general in coverage of topics, while LD sources (and their possible quality issues) can be very specific to a given domain; as a consequence, the capability of an experts' or workers' crowd to identify and assess LD quality issues is highly influenced by the domain/coverage of the source. Therefore, I'd recommend to soften the claims of generality of the presented results and to clearly state that they are "proved" only on DBpedia. I'm sure the authors can speculate to what extent those results can be considered general, but at the present moment I believe they cannot affirm that they fully addressed the research questions as they are introduced.
Furthermore, the experiments were focused on some specific LD quality issues and not to the whole list of possible issues (which are comprehensively listed in the authors' previous works). While this is fine per se - I didn't expect the authors to make experiments on the whole set of issues - it makes the presented results even less general. It should be also valuable if the authors add an explanation on why those specific quality issues (instead of others) were selected for the experiments.
As a global recommendation, I suggest the authors to honestly rephrase the parts of the paper that would try to convince the readers about a possible general validity of the paper results for any LD source and/or for any LD quality issue.
Regarding the experiments design, my impression is that a number of results are not fully related to the intended characteristics of the experiments (expert vs. laymen crowdsourcing, find vs. verify stage, etc.) but are the collateral effect of non-optimal design choices, in terms of (1) choice of triples/data and quality issues to be tested, (2) user interface and support information provided to participants and (3) reported indicators and baselines.
Apart from the already provided considerations on DBpedia as above, I have a number of concerns on the employed triples. The experts were given the opportunity to find quality issues in triples that were (i) random, (ii) instances of some class or (iii) manually selected; while this can appear reasonable, the effects are that the workers' crowds (in both experimental workflows) were presented with information "chosen" by somebody else, thus possibly making the task hard or even impossible because of the triples' domain. I would have expected the authors to make a *controlled experiment*, i.e. selecting a general-purpose subset of DBpedia that - at least from the point of view of the content - was at the same "difficulty" level for all the involved crowds. Furthermore, also restricting to a set of selected subjects, I think that not all triples were suitable for the intended experiments; indeed, some specific cases emerged that are not related to the intrinsic characteristics of quality assurance; while it is generally ok to let the experimenters find out problems, it is also reasonable to think that, when preparing an experiment, the obvious things that can lead to problems are avoided. Some examples:
- specific datatype objects (like dates vs. numbers, which are definitely ok to be mistaken)
- owl:sameAs links (which maybe were interpreted in a "purist" way by LD experts who can be careful in accepting those triples because of their logical implications)
- rdf:type triples among the incorrect link issues (apparently unclear for the MTurk workers, but partially also to me: why did rdf:type triples were considered among the "links" instead of the "values"?)
- DBpedia translation-specific triples (which do not make any sense in such an evaluation setting, and should have been filtered out in the first place).
Another fact to support this criticism is that the two authors who created the "ground truth" got quite low values of inter-rater agreement.
Also the tested quality issues are somehow unbalanced: while the incorrect object extraction or the incorrect links can be related to the entities' "meaning", the incorrect datatypes or language tags are more "structural" mistakes; as a consequence, it is not surprising that the latter is the case in which the paid workers performed worst.
Some experiment design flaws come also from the user interface and the information provided to the involved crowds.
First of all, the quality issues to be identified are not presented with the same granularity level to experts and laymen, since the experts got quite a detailed taxonomy of issues, while the MTurk users only three possibilities.
Regarding the MTurk-based Find stage, I personally find the screenshot in Figure 3 very confusing, since it seems that the Wikipedia column (which should provide the "human readable" information) is less complete that the DBpedia column: how were the workers expected to interpret this fact? were they instructed to click on the Wikipedia page link (if provided) to check?
Regarding the MTurk-based Verify stage, Figure 5 is also problematic since it doesn't display the Wikipedia preview (explained in the text) and seems anyway to require quite an effort or some knowledge to be judged; it would have been interesting to know if the authors were able to trace if and how many times the workers actually clicked on the Wikipedia link or how much time it took to them to make a decision.
Also some of the examples given in the text are misleading and therefore not fully suitable to be offered to the crowds as explanations (if they were); e.g. is Elvis Presley's name language-dependent?
I also have some doubts about the choice of evaluation metrics.
Regarding the tables, the authors would have better used sensitivity and specificity rather than TP and FP, because rates are more easily compared and interpreted than counts. This last comment also applies to bar charts, which are hard to judge because of the different value ranges: using rates would improve readability and better convey the message.
Furthermore, I am not at all convinced about the significance of keeping track of the first answer, even less of the comparison between the first answer and the majority voting: while I understand the cost consideration, it would have been more meaningful to compare 3-worker majority voting vs. 5-worker majority voting, since 1 single worker cannot express any kind of answer "agreement" or "variance".
Finally, I found the baseline section quite weird, since the authors describe the interlinks approach that perfectly makes sense (even if it regards only one of the tested quality issues), but they also introduce the TDQA assessment which cannot be compared to the experiment results (and thus cannot be considered a baseline approach). The authors would better create a baseline (e.g. by using SPIN or ShEx-based constraint checks) to try to identify datatype/language and object values issues (w.r.t. all those cases in which such checks can be implemented, of course); that would be a reasonable baseline to compare to.
In any case, I would suggest the authors to make a final summary table that compares the two workflows as well as the comparable baselines, so to support the final discussion.
Some relatively minor remarks:
- page 4, 1st column, postal code example: in some countries postal codes contain letters, so it is not necessarily true that it should be an integer
- sections 3.1.1 and 3.1.2 do not provide any reference
- section 3.2 is clearly related to reference [3], so there is no need to include the citation several times
- page 6, definition 1: why is it 2^Q? can all the quality issues happen at the same time?
- page 6, beginning of 2nd column: this is very specific to DBpedia, so it is in contradiction to the generality claims of the paper
- page 8, end of section 4.1: the authors explain the redundancy during the Find stage by experts; if an agreement is already achieved, is the Verify stage useful at all?
- page 9, 1st column: it seems that the prune step is specific to the experimental setting, rather than to the general case (non-dereferenceable URIs should have been discarded in the first place...)
- page 9, 2nd column: reference to Figure 1 should probably be Figure 3
- page 10, footnote 8: it is not simply for sake of simplicity, since datatype and language tag cannot happen together; furthermore, for the laymen probably there is not much difference between "value" and "link" either
- section 4.4 is not completely necessary in the paper
- page 15, end of 1st column: why were the DBpedia Flickr links filtered out? if there was some doubt about their validity or relevance to the tests, why not filtering them out before the Find stage?
- page 15, section 5.2.4: the example triple is totally unclear, what does it mean? why is is correct?
- table 3: from the text 1512 seems to be the number of the "marked" triples rather than the evaluated ones
- table 4: the caption does not explain that the results refer to the "ground truth" sample (same for table 6); why the LD expert inter-rater agreement was computed for all the triples together?
- page 16, beginning of 1st column: the need for specific technical knowledge about datatypes seems to be yet another experiment design flaw
- page 17, list in the 1st column: what are Wikipedia-upload entries? what does it mean w.r.t. the misclassification discussion?
- page 18, section 5.3.2: the text says 30k triples while table 5 almost 70k triples, so what's the correct number? why was the sample selected on the basis of "at least two workers" and not by majority voting? the sample contains the "exact same number of triples" or exactly the same triples? why did this Verify stage take more time than in the case of the other workflow?
- page 18, end of 2nd column: the geo-coordinates example seems yet another symptom of an ill-designed experiment
- table 5: the sample used for the Verify task does not have the same distribution of triples for the quality issues than the Find stage; can the authors elaborate of the possible effects of those different proportions in terms of loss of information?
- page 19, 2nd column: the problem with non-UTF8 characters seems another sign of sub-optimal design of the user interface for the experiments
- page 20, 2nd column: possible design flaw also in the case of proper nouns
- figure 7(b): TP+TN are complementary w.r.t. FP+FN; rates would be more meaningful than total counts
- page 21, 1st column: there are a couple of "Find" that are more probably "Verify"; it would be interesting to know if the rdf:type triples correctly classified were done by the same worker(s)
- page 21, 2nd column: it is not clear on how many triples the 5146 tests were run, on the 509 "ground truth" triples? what exactly is a success/failure in the tests?
- page 22, 2nd column: were only the foaf:name links used or also the rdfs:label ones? the listing is somewhat useless, the text was clear enough; also clear it is unclear what the "triples subject to crowdsourcing" were, since different datasets were used in the previous tests
- page 23, 1st column: I didn't get what the following consideration refers to: "workers were exceptionally good and efficient at performing comparisons between data entries, specially when some contextual information is provided"
- page 24, footnote 21: the link is broken
- page 24, 2nd column: "fix-find-verify workflow" is probably Find-Verify
- page 25, end of 1st column: the authors write "Recently, a study [18]..." but the paper was published in 2012
|