JRC-Names: Multilingual Entity Name variants and titles as Linked Data

Tracking #: 1232-2444

Authors: 
Maud Ehrmann
Guillaume Jacquet
Ralf Steinberger

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
Abstract: 
Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/ بنیامین Netanyahu/Netanjahu/Nétanyahou/Netahny/Нетаньяху/ نتنیاهو ). This entity name variant data, known as JRCNames, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jorge Gracia submitted on 24/Nov/2015
Suggestion:
Accept
Review Comment:

This article explores the new linked data version of "JRC-Names", a multilingual dataset with entity names and variants collected by the European Commission's Research Centre since 2004. The variants have been collected after a (mostly) automatic extraction process from the online version of printed media. The paper describe the selected model (based on lemon) and the other aspects related to the linked data version generation.

The paper is very well written and structured. It is also well illustrated with examples and supported by relevant external references. The motivation discussed at the beginning is strong. In my view the potential impact of this resource is high, and moving it into the Web of Data makes it even higher and more useful for the community. Specially interesting is the potential of this dataset for cross-lingual linkage and cross-lingual information access. The authors demonstrate a good understanding of the lemon model and the other underlying technologies.

In my view, the authors have addressed the reviewers' comments well and I find this work suitable for publication. However, I strongly recommend the authors to solve the issue of updating the dataset in datahub.io, either creating an organisation or asking the responsible of the previous version to update it. In this way, metadata aggregators such as LingHub (http://linghub.lider-project.eu/) will make the updated dataset easily discoverable. Further, the figure of the LLOD cloud (http://linguistic-lod.org/llod-cloud) is automatically built on top of the data available in datahub.

Finally, references should be cross-checked once again (e.g., in [29] author names do not follow the abbreviation pattern of the other references).

Review #2
By John McCrae submitted on 06/Dec/2015
Suggestion:
Accept
Review Comment:

This paper is significantly improved from the previous version. In particular it seems that the linking is described more detail and there is much more room given to the paper, however at the cost that this paper is now much longer than the guidelines for a dataset description in this journal. In general the paper is well-written so the extra length is still of value.

The linking described in the paper still seems to be a very small amount of the dataset, in that it is less then 100,000 links for 1.7 million entries. It would be good if the authors comment more on why only such a small section of the dataset can be linked.

Minor issues:
=============

"Web of Data" (should be capitalized)
"As regards the Semantic Web" => "With regards to the Semantic Web"
"LexInfo" is the name of the resource (not "LexInfo2")
"72,5 million" (use period)

Review #3
By Gabi Vulcu submitted on 09/Dec/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

=============================

The major comments of the previous review were fixed. However there are a couple more minor details:

- paper has a little bit over 10 pages (excluding the references)
- there are some strange '/' at the beginning of each page

Section 3:
- paragraph 1: remove comma after "3.2),"
- before going into details about the classes and properties used to model JRCNames as linked data the authors should refer Figure 1 earlier rather than later so that one can look at the example.
Going along an example while describing the model is much more useful for understanding this representation. I recommend that the last paragraph of section 3 to be put at the beginning of section 3 or at least at the beginning of subsection 3.3.

- Figure 1:
- the figure in printed version is barely visible. Online, due to zoom, it can be followed.
- typo: jrc-names:Jean_Cluade_Juncker__it => jrc-names:Jean_Claude_Juncker__it
- the base variant concept is not exemplified in the figure. Will it be somethign like jrc-names:Jean_Claude_Juncker with no language associated ?

- Table 1
- lack of consistency. In text you use MEP and in the table you use "Talk of Europe"

- Section 5 also uses "Talk of Europe". Maybe it makes sense to not use at all the MEP acronym?


Comments

The task of identifying names within a text is known to be very difficult and a good dataset such as JRC Names is vital to a wide-range of text processing tasks that rely on named entity recognition. The linking of this dataset to other resources enables these systems to easily be extended to entity linking, a common task in Semantic Web systems, further improving its usability and the likelihood of third party adoption. In addition, there is already work using this dataset for Social Media recognition (as part of http://languagemachines.github.io/mbt/) and other applications.