LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain

Tracking #: 3351-4565

Vasile Pais
Maria Mitrofan
Carol Luca Gasan
Alexandru Ianov
Corvin Ghiță
Vlad Silviu Coneschi
Andrei Onuț

Responsible editor: 
Harald Sack

Submission type: 
Dataset Description
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time expressions and legal resources mentioned in legal documents. Furthermore, GeoNames identifiers are provided. The resource is available in multiple formats, including span-based, token-based and RDF. The Linked Open Data version is available for both download and querying using SPARQL.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Feb/2023
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here . Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Review #2
By Georg Rehm submitted on 08/Mar/2023
Minor Revision
Review Comment:

Thank you for submitting the revised version and for shortening the paper to a more reasonable length. Also thank you for taking care of the figures and other aspects I suggested in my previous reviews.

I only have a few minor remarks, which have to do with the presentation and length of the paper.

You appear to have followed my suggested cuts exactly, which resulted in quite a few very short lines of text, which only consist of a single word or syllable, for example:

- Page 1, right column, line 36 ("tion 8.")
- Page 2, left column, line 7 ("processes.")
- Page 2, left column, line 19 ("gal entities.")
- Page 2, right column, line 17 ("NEs.")
- Page 3, left column, line 13 ("correct these mistakes")
- Page 3, left column, line 20 ("tators.")


For these and all other short or very short lines I strongly suggest to edit the corresponding paragraph in such a way that the whole paragraph becomes a bit shorter so that these short lines are avoided. Two reasons: such short lines obviously add to the length of the paper and they should be avoided from a typography point of view. This comment also applies to items in itemize/enumerate environments and in all other cases. In almost all cases a simple reformulation of what is said in the corresponding paragraph will avoid the short line.

A related comment: footnotes take up an enormous amount of space (in the LaTeX class of the Semantic Web Journal one footnote takes up 2-3 lines of text), so I'd suggest to go through all footnotes one more time and to decide for each footnote if it's really needed. If not, please delete it.

My second major comment relates to Figures 1, 2, 4, 5, 6 and 7 and Appendix A: instead of the basic font (Times New Roman), which is a variable-width font, please use a fixed-width font – Courier is the obvious choice. The current versions of the figures/listings are very difficult to read/decipher, using Courier will make a big difference in terms of improving the readability/usability of these figures. Please also consider applying syntax highlighting (using colours etc.) as provided by the listings package.

Review #3
Anonymous submitted on 09/Mar/2023
Review Comment:

Thanks to the authors for submitting an updated version of the manuscript. The authors have already addressed most of the major concerns I have had. Therefore, I would like to accept the paper as it would be a significant contribution in the field of Legal data for future research.