Remixing Entity Linking Evaluation Datasets for Focused Benchmarking

Tracking #: 1882-3095

Jörg Waitelonis 1
Henrik Jürges
Harald Sack

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In recent years, named entity linking (NEL) tools were primarily developed in terms of a general approach, whereas today numerous tools are focusing on specific domains such as e,g. the mapping of persons and organizations only, or the annotation of locations or events in microposts. However, the available benchmark datasets necessary for the evaluation of NEL tools do not reflect this focalizing trend. We have analyzed the evaluation process applied in the NEL benchmarking framework GERBIL and all its benchmark datasets. Based on these insights we have extended the GERBIL framework to enable a more fine grained evaluation and in depth analysis of the available benchmark datasets with respect to different emphases. This paper presents the implementation of an adaptive filter for arbitrary entities and customized benchmark creation as well as the automated determination of typical NEL benchmark dataset properties, such as the extent of content-related ambiguity and diversity. These properties are integrated on different levels, which also enables to tailor customized new datasets out of the existing ones by remixing documents based on desired emphases. Besides a new system library to enrich provided NIF datasets with statistical information, best practices for dataset remixing are presented, and an in depth analysis of the performance of entity linking systems on special focus datasets is presented.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Sep/2018
Review Comment:

The paper at hand describes a remixing of entity linking datasets to gain more insights. In particular, the authors extended GERBIL and implemented an own adaptive benchmark generation to enable more fine-grained solutions.

After reading the last cover letters, I mainly focused on the density metric as well as the soundness of the proposed mathematical notation. Those two aspects were solved successfully. However, the mathematical appendix needs notations, e.g., E^W(s), see dominance formulas.

The biggest issue in this version is the use of the News-100 dataset which is a German dataset but was 1) tested against the English DBpedia endpoint resulting in many unspecified resources (cf Table 3). 2) Also, there is an analysis missing at the end of section 4.1 describing this unusual peek in table 3.

The second biggest issue the use of the GERBIL paper from 2015. Meanwhile, there is a new version from 2018, and I would like to ask the authors to update the numbers on page two by

Minor issue:
* P1, the survey on NEL is from 2015. Please pick up a newer one, e.g., by Hogan et al.
* Please unify the citation style with and without authors: [7] vs. van Erp et al. [27]
* The link does not work (6th Sept. 18), fix it and implement monitoring to ensure that the demo is always up
*"E.g." at the beginning of a sentence should be "For instance" or "For example"

Overall, I liked the journal article a lot, especially insights and provided resources like on Page 9, 2nd column second-to-last paragraph. The contribution as original and were extended over the base version at SEMANTiCS. The results and analytics are significant and will contribute in future research. I can imagine giving this paper to a new PhD student to get an overview of the challenges in ther NER/NEL research field. The paper is well-written except for two missing symbols in the mathematical appendix.
Thus, this paper should be accepted and published after issue above are included.

Review #2
By Ziqi Zhang submitted on 10/Sep/2018
Review Comment:

The authors should be praised for taking significant effort to address reviewer comments, as I can see the quality of the paper has substantially improved. The research is substantive and I think some findings are really interesting. The experiments are also well designed and well explained. I want to highlight their improvement on the math notations, they are much clearer and easier to read now. I thank the authors for carefully addressing all of my comments and I think their efforts deserve to be recognised. I would like to recommend acceptance, subject to a number of minor issues to be addressed:

Page 3, 4th paragraph in the left column: '...with its character index i and the text length t...': can you clarify if the length is measured by character or words/tokens?

Page 6: equations 5 deserves an explanation of its intuition (1 or 2 sentences would be sufficient), e.g., can you say a bit about when the score is high/low? It is also worth to remind readers again in their description that s comes from the annotation a, e.g., something like 'where s is the surface from the annotation tuple a=(s, e, i, l)'. There is nothing wrong with the notations, it is just a little difficult to read and I find myself having to go back to search for the relation between 'a' and 's' to understand the formula.

The same can be said for equation 6.

Page 6: equations 7 --> the upper term \sum_{(s,e) \in d_a}{...}, perhaps you should write \sum{a \in d_a}{...}, to be consistent with equations 5 and 6. Again if you have reminded readers that a has a format of (s,e,i,l) it should beok.

The same can be said for equation 8.

There are many places where you should have use the ` character instead of ' for left single quote.

Review #3
By Michelle Cheatham submitted on 10/Sep/2018
Review Comment:

The authors’ response to my comments was much appreciated. Regarding the specific issues:

I completely understand that the authors do not have the resources to manually annotate documents to show the utility of the annotation density metric. What I was encouraging was for the paper to mention the limitations of this metric and to make less strong statements regarding it. This has been done in the revision. I do still take issue with the statement related to the “not annotated” metric that “Therefore, empty documents should be excluded from evaluation datasets to enable a sound evaluation”, based on the same concerns I had with the annotation density metric – it is possible that these un-annotated documents are not actually mistakes but rather don’t contain any entity mentions. Without manual analysis it is not possible to be sure, so the authors should probably make a less strong statement – more of a warning than a direct recommendation to leave such documents out of the evaluation dataset.

Rather than the labels fair, unfair, and unfair 2, I think your intention might be clearer if you referred to them as low skew (fair), medium skew (unfair 2), and high skew (unfair). Your revisions to that section make the point that you cannot draw any conclusions from the unfair dataset more clear to me, though I’m not totally convinced your data supports that – the systems’ results on the fair and unfair datasets were in general quite similar, which seems to be an argument that this issue does not actually create issues for evaluation.

I think there is a small mistake with the notation in Section 2: it should be a = (s, e, i, t) – t instead of l for the last letter since you define text length as t in the description.

The paper is well written and in general easily readable, but there are still quite a lot of grammatical issues. Some of the problems that I pointed out in my last review are still present, and a few more have been introduced in the revisions. It might help to have a native English speaker proofread the paper.

Other than these fairly small, superficial issues, I feel that this paper is ready for publication.