Review Comment:
This paper presents a named-entity recognition (NER) approach based on conditional random fields (CRF) for Turkish. In their experiments, they re-annotate some of the existing datasets for additional entity types, and compare their performance to existing systems in the literature based on the published results.
The positive aspects of the paper can be summarized as follows:
- The paper is related to the SWJ special issue.
- As stated by the authors, this paper is an extension of an earlier version presented at COLING 2012. While the main contribution is still the CRF based NER method, I think the extension has fair amount of additional content, which is essentially in the experiments part. in particular, the authors extended some of the existing datasets for TIMEX and NUMEX datasets to repeat the experiments in their previous work. Furthermore, they evaluate their approach using User Generated Content (UGC), which is typically considered as a different and more challenging setup for evaluating NER.
- The authors put solid effort in implementation and experimentation, and show that their system is performing good.
On the other hand, there are still some major issues that need to be resolved:
1) First, I believe that starting from the title, the general attitude of the paper is kind of misleading: This is not a paper aiming to survey existing NER work in Turkish, but indeed, it proposes a new method (at least in the context of Turkish) and evaluates its performance against the earlier approaches (also see below for some concerns at this point), as a typical research paper would do. So, I really don’t agree with the sentence in the abstract: “This article presents the state of the art in Turkish named entity recognition both on well formed texts and user generated content, and introduces the details of the best-performing system so far.” What this paper really does is proposing a method, and evaluating it (using well formed texts and UGC), which is actually a reasonably adequate contribution. I strongly recommend the authors to frame and present their work in the latter line.
2) Secondly, I would like to see at least a brief discussion on the use of CRF in NER task in other languages and its success as reported in the literature, and whether there is anything specific for Turkish while applying CRF. It would also be nice to discuss the previous results for other languages that are similar to Turkish, i.e., in the family of agglutinative languages.
3) Third and most crucially, I think the comparison to the earlier studies is rather superficial: apparently, the authors have not implemented any of the earlier approaches and simply make comparisons based on the published results. While I agree that it is not reasonable to expect all previous approaches to be implemented, at least a couple methods should be implemented as the baselines. Furthermore, in the current state of the paper, it looks like that in most of the comparisons to published results, there are differences in the setup, dataset, annotations etc.; and it is hard to draw a strong conclusion and suggest the proposed method as the “best performing one”, as this paper does.
Here are some specific examples of such comparisons that seem unconvincing or at least incomplete to me:
- Comparison to [18] (page 12): The paper states that “They work on ENAMEX, TIMEX and NUMEX entity types but they do not provide the scores for each of these. In order to be able to make a fair comparison between the two studies, we measure the performance of their system on our test data and calculate the overall ENAMEX performance (F-Measure) as 69.78% in CoNLL metrics and 74.59% in MUC TYPE metrics.” Why do you only measure the performance for ENAMEX type, you do also have the annotated data for the other NE types?
- Comparison to [40] “We use the same training and test data, so our results given in CoNLL metrics are fully comparable with this work.” But you should compare to the results on the test set WFS3, right (as [40] did not use your newly introduced dataset WFS7?)? Given the next sentence “One should note that our performance before adding the gazetteers (89.55%) is still higher than her best result (88.94%)”, I suppose this finding is on WFS3 (based on the previous COLING paper), but this should be clarified.
- Comparisons for UGC datasets: First and foremost, there is a big confusion in the description of the training and test sets: In section 6, the paper states that “Following the previous work [40,4], in all of the provided experiments, we used 440K tokens of the news articles [39] (Table 2) as the training set and the remaining 47K tokens as the test set (WFS) for well formed text domain.” And then the test sets are listed, which include both the WFS3 and WFS7; as well as all the other UGC sets! So, is the training for the UGC case also done using the WFS set; and if yes, is it WFS3 or WSF7 (assuming that you extract additional NE types, should be the latter?)? Otherwise, on your own UGC data (Tables 8 and 9) and on Tweets-2 and -3 (Table 11); which training datasets are you using?
Nevertheless, if you have used the newly annotated WFS7 dataset for training, or some newly annotated UGC dataset, then your training set is *always* different than other works, say [16] and [9], so how reliable are the comparisons in Section 6.2.2?
4) Last but not the least, several experimental details are not clear, which may also lead some of the confusions discussed above. In particular:
a) For Table 4: To what extent the gazetteer lists are different than those in your previous work [4], please state clearly.
b) Sec 5.1.6: “We provided our atomic features within a window of {-3,+3} and some selected combinations of these as feature templates to CRF++.” Please specify what exactly these combinations are.
c) In Tables 5, 6, what is the evaluation metric, is it CoNLL? Please specify. Also briefly discuss the relationship of F-measure and CoNLL, as they are compared to each other in certain cases. For instance, page 10 states “We also executed the same experiments with 10 fold cross validation and obtained an average F-measure of 91.53 with a standard error of 0.50.”, which is confusing, as a totally different setup is described along with the results using F-measure metric. Please elaborate.
d) Why are the comparisons in Tables 8 and 9 not uniform? Tables 8 has the base model and feature analysis; but table 9 does not; at least base model should be there.
e) Section 5.1.4: the paper states that “In this format, we use the labels such as “PERSON”, “ORGANIZATION”, “LOCATION” and “O” (other - for the words which do not belong to a NE) without any position information.” Is this sentence up-to-date, as you are also annotating now TIMEX and NUMEX types? In general, I do very strongly recommend to specify the training and test sets explicitly for each experiment reported in Section 6.
f) For the UGC dataset used in testing, is it guaranteed that re-tweets or other sorts of duplicates are removed?
Overall I believe that this work has some merit. However, I recommend a major revision of the paper to a) frame the paper’s contribution better as proposing and evaluating a new approach rather than a survey, b) clarify of the experimental setup and results, and c) provide a more detailed comparison to the literature (i.e., either implementing by some key methods, or providing a more careful comparison to published results by taking into account the differences in their setup and avoiding excessive claims, etc.).
|