Review Comment:
The manuscript “LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain” is a data description paper. It describes the LegalNERo dataset, which is a manually annotated corpus for named entity recognition in the Romanian legal domain. The dataset is made available through Zenodo and also through a SPARQL endpoint hosted on a server of the research centre in which the dataset was developed. The dataset itself is available in a number of different formats including BRAT, RDF Turtle and CoNLL-U Plus.
The paper follows the typical structure of a dataset paper (Introduction, Related Work, Annotation Process, Corpus Description, Using the RDF Version, Corpus Usage, Conclusions). All in all, the descriptions are sufficient to enable re-use and adaptation of the dataset.
There are, however, a number of issues with this data description paper.
First and foremost, surprisingly, while the title of the paper implies a certain relationship to the legal domain, neither the paper nor the dataset specifically tackles any aspects related to NER in the legal domain except for adding one additional entity category to the typical person, organisation, location structure and that’s “legal resources” or “legal document references”. In the Related Work section the authors do cite and acknowledge the many different types of entities that are specific to the legal domain but they decided not to introduce any of these specific types or categories themselves (except, as mentioned, “legal document references”). This decision needs to be explained and motivated since the paper and especially the dataset claim a certain focus on the legal domain, which does not exist except for the documents used, which are indeed from the legal domain. The size of the dataset is rather small (370 documents with 265k tokens, 8k sentences and a total of 54k annotated tokens).
In various parts of the paper “time” is mentioned as an entity type. Time expressions are simply time expressions. As explicit time expressions can be recognised easily with a number of regular expressions (like currency expressions), they are often used/implemented in NER tools but it’s simply incorrect to call time expressions “named entities”.
Some additional comments:
Page 1, line 27: suggest to change “international project” to “EU project”.
Page 1, line 29: please explain “comparable” in the context of this paper (or delete it)
Page 1, line 30: “7 languages” should be written as “seven languages”. There are other sentences in which one-figure numbers are written as actual numbers – these should all be changed into the corresponding words (see, among others, page 2, lines 10 and 20).
Page 1, lines 38/39: “All these annotations were realised using automatic processes.” My understanding of the paper is that all annotations were performed by human annotators.
Page 1, 2, lines 45 ff.: The relevance of this paragraph in Section 1 is unclear. It should probably be moved into Section 2.
Page 2, lines 21 ff.: Section 2 is missing in the summary paragraph.
Page 2, line 23 ff.: Please include a link to the annotation guidelines or include the annotation guidelines in the dataset on Zenodo.
Page 3, lines 1/2: I don’t understand why hiding certain information about the annotation process helps with the computation of the inter-annotator agreement.
Page 3, lines 15 and 32: It’s “Cohen’s Kappa” (not Coehn’s Kappa)
Page 3, line 17: The Cohen’s Kappa of 0.87 is surprisingly low given that the annotation task is so simple. The revised metric (0.89) is still low so where exactly are the actual disagreements?
Page 6: In Section 6 the development of various NER models is mentioned but the evaluation of these models is missing.
Finally, the paper is in need of a thorough round of revisions, there are many typos, missing words etc.
|