Review Comment:
This paper on the crowd and NER expands an ESWC paper. I am very sympathetic to this work, and think the angle that it takes is very fruitful. However, the analyses are missing in places, or only cover surface observations, and it is hard to see that the novel contribution for SWJ overcomes the ESWC paper in mnay places. To bring it up to standard, it could do with further analysis and some deeper insights (i.e. perhaps just some more thinking time and a refactoring of the content). Specific feedback is given below. Personally I would much rather see this ms invested in and developed into a strong, sharp contribution at the end of the trail that the ESWC work started, rather than end up at another lesser journal - understanding the link between diverse annotation environments and diverse annotation skillsets is crucial and a timely problem in the field.
Specific feedback:
The writing quality is never a problem.
Consider reworking the title, perhaps dropping the first three words and appending "in NE annnotation". To me, "Towards" implies that you didn't get there, which is fine, but this study is pretty much complete.
Page 2 para 2 "This paper offers"...: this is almost identical to the ESWC work, but this paper offers more, and that more should be stated here
Capturing uncertainty in annotation has been addressed recently (Plank et al. EACL 2014 best paper), and monitoring the interaction between crowd diversity and crowd recall has been examined too (Trushkowsky et al. ICDE 2013 best paper). Both these works related to this research at multiple places in the paper, and their findings/technqiues are very relevant.
Reference 17's ø is encoded wrongly.
Cohn ACL 2013 "Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation" addresses extrinsic factors in estimating annotation quality, which is relevant to the literature noted here. The lit rev is very focused on the text-level task of crowdsourced NER, and does an excellent review as a direct consequence, but could be made broader in scope - we discover later that per-worker information is important, which intersects precisely with the Cohn ACL 2013 work. I.e., one should factor in crowd worker performance to the annotation evaluation - a relevant point that has been made in the literature, and so should be present in the literature review.
Section 4.2: How good is Alchemy at sentiment? Can you evaluate this, e.g. with a SemEval test set? Or reference something that has? Otherwise, use of Alchemy just introduces noise of unknown reliability in the analysis.
Throughout: "nerd" -> "NERD"
Throughout: Table placing is very odd and really hinders readability. Please reconsider, and keep the tables near where they are referenced, in order. Personally I'd let latex take control of it, and just leave the table code higher up in the document than each table's first mention. E.g., Table 12 is mentioned before Table 4 (pg 9) and is presented on pg.14.
Section 5.2: After the paragraph marks please use a capital letter, e.g. "consider the tweet" -> "Consider the tweet"
Throughout: Use backticks for opening quotes, e.g. by "click on a phrase"
I'm not sure that the C1/C2 experiments demonstrate the point in ref [2] quite as directly as suggested in Section 7.1. Looking at Table 4, the loss is mostly in precision. This is at first sight a bit weird: precision is the part that the automated systems generally do better with (see [12]). In fact, in Table 4 recall is pretty much the same between C1 and C2. Low precision indicates cases where workers find "false positives", i.e. annotate entities not in the reference set. However, we already know that crowd recall is lower with smaller worker counts, and we know that diversity is key to getting good recall (cf. Trushkowsky ref above). So, what does this all suggest? I think what's really happening here, that the authors missed, is that existing datasets have low recall. The easy entities remain easy in C1 + C2; that's why recall is stable - the original annotators, and the crowd workers, always get them. Then, the crowd workers represent diversity beyond the original annotators, finding new entities, which presents as a precision drop. This is just like the inability of expert annotators to resolve things like "KKTNY", in prior datasets, which a diverse crowd *can* do. So while this table really shows us something happening, it's not enough to support [2] (although of course we all know it's intuitive: the raised cognitive load of annotating with a complex standard will lead to reduced performance); it's especially not enough to support [2] given the low number of crowd workers, and what we saw in the literature about required crowd worker counts. In short: something has been found here, but that thing is not strong evidence for [2].
The LOC/ORG ambiguity mentioned on pg9 col2 para2 is just classical metonymy - a well-known problem in NER (see e.g. Maynard et al. RANLP 2003 "Towards a semantic extraction of named entities").
What does Table 7 say about sentiment in skipped tweets? It's not clear - don't we need to see the general sentiment distribution as well, in order to know whether these distributions are anomalous/significant? I couldn't work it out from the table and description - may just be an expository thing.
In Figure 3: do you have some analysis of why Wordsmith entities are skipped so consistently? Even LOC doesn't manage to dip below the average across all datasets.
On page 12, col 2, para 1, the paper discusses the times taken under C1 and C2. The paper would be stronger if some kind of analysis or hypothesis was made about the reasons behind this observation: is it perhaps due to greater annotator confidence, rooted in how much guideline material they've read?
On page 13, col 2, para 1, I lost track at the end of the description of the experiment detailed in Table 13. Is it possible to give one or two worked examples of this process? The intuition from the current description doesn't fit the data. In any event, the IAA scores are so low as to suggest the data is purely noise - how did this happen? Why? How could it be remedied? And what is the impact on dataset utility?
Continuing to table 13: How can raising the IAA threshold improve recall? Doesn't this only ever eliminate entity annotations? Removing annotations absolutely cannot lead to recall increases - unless we're removing them from the gold standard. It's really not clear.
The reference to the confusion matrices (Table 5) on pg 15 is too far away from the content - please fix the table locations and order.
Page 5 col 2 para 1: Facebook, Youtube, Instagram ought be italicised.
Section 7.2.2: RQ1.1 expects, rather than states, no?
The finding at the end of the section's "Number of entities" paragraph is key, and to my mind one of the most important points of the paper. It's buried in here, though; I think more focus should be drawn to this section's analyses in the conclusion and abstract.
In the "Entities types" para: this doesn't seem to be written in light of the base distribution of ORGs in tweets; if they're the most common entitiy, then won't they be the most skipped anyway? Can't immediately see that this field effect is controlled for before making these observations. Perhaps a very clear, low-reading-age of the entity skipping process and experiments would help.
Page 16 col 1 para 2: "well mainly formed well structured" - word order problem? And there should be more vspace after this paragraph.
The Micropost text length section reported findings, but I hoped it would address the question of why this happens. Can we see perhaps how these tweets look in the UI used? Are results consistent in other crowdsourcing settings, e.g. the GATE Crowd system (Bontcheva et al. EACL 2014), or is there a known weakness of the experiments in that they use the same UI? If that's the case, it's OK, but must be stated. The findings of this paragraph to me look like an artefact of an HCI choice.
Page 17 col 1 para 1: "started 1 entities" - spurious 1? This all sounded alright, I think; 24 is nicely between 15 and 35, so there's no problem here - is there? "We took into account the responsive nature" - I didn't understand this section, only knowing the tool from its description in this ms.
Section 8 "Useful guidelines are an art" - this is the only novel paragraph in the discussion, above the ESWC paper, as far as I can see. That places a lot of the paper's novelty on this content. Can it be expanded into more than one point? There's a lot going on here
Conclusion: Implicitly named entities are mentioned at the top and foot of the paper, but get no real focus in its body. Can there be a section / subsection dedicated to them, including a definition - critical, as they seem to be introduced in the big, important parts of the paper, but aren't defined prominently in either this paper or the ESWC prior version.
Bibliography - first initials or firstnames?
|