Review Comment:
This article explores the new linked data version of "JRC-Names", a multilingual dataset with entity names and variants collected by the European Commission's Research Centre since 2004. The variants have been collected after a (mostly) automatic extraction process from the online version of printed media. The paper describes the representation model (based on lemon) and the other aspects related to the linked data version generation.
The paper is very well written and structured. It is also well illustrated with examples and supported by relevant external references. The motivation discussed at the beginning is strong. In my view the potential impact of this resource is high, and moving it into the Web of Data makes it even higher and more useful for the community. Specially interesting is the potential of this dataset for cross-lingual linkage and cross-lingual information access. The authors demonstrate a good understanding of the lemon model and the other underlying technologies. Here are some comments that I hope will help the authors to improve the quality of the submission:
- The paper is of "dataset description" type. The authors should cross-check the length restrictions of this type of submissions(up to 10 pages I think).
- I would clarify better the notion of "prior probabilities", introduced in page 5.
- In section 3.3, they say "This base name is not marked with a language..." but I do not see why not. If the language of the preferred label is known, reporting it can only be beneficial!
- In figure 2, why there is no "lemon:reference" relation between "jrc-names:Claude'owi_Junckerowi__pl#sense77" and "jrc-names:Jean=Claude_Juncker"?
- The model should be made available and dereferenceable online http://open-data.europa.eu/jrc-names#
- Something I miss in the paper is a quality-oriented evaluation of the extracted names and variations. In fact, they describe (section 2.2) some strategies to reduce the noisy terms, as for instance applying a threshold and filter out those variations with a low frequency. But a quantitative measure of the improvement is not reported. I understand, however, that this is not essential in this type of paper (and possibly this issue is more related to the general EMM framework), but adding a few lines about any already performed quantitative evaluation, or plans for future ones, would make the submission even stronger.
- The resource is freely available online and the relevant pointers are included in the paper. However, the entry describing JRC-Names should be updated in http://datahub.io, as it refers to the old MLODE'12 version currently.
- lemon:LexicalVarian is not well capitalised, it is lemon:lexicalVariant. This has to be corrected both in the text and in figure 2.
- Some references (e.g., 23, 29) have the complete name of the authors and not the initials as the other citations.
- Finally, a very minor one: lemon is written sometimes in italics, sometimes not. I would recommend to make it homogeneous and write it always in italics.
|