Review Comment:
This paper describes an extension of the GERBIL framework with an aim to evaluate the quality of entity linking datasets and ultimately to automatically create (by remixing) balanced datasets. This could be a valuable contribution to the community because on the one hand, entity linking is a very important task to the Semantic Web and the problems with existing datasets discussed in the paper are valid; on the other hand, creating high quality datasets will enable a balanced and thorough evaluation of newly developed methods. Unfortunately, the quality of the paper is rather unsatisfactory for acceptance for three reasons: 1) the research problem described in this paper has been largely addressed already in the authors’ earlier work [20] as there is very little added value in terms of their methods. The major development is a tool and for this reason, I do not think that the originality or the significance of results is good enough for a ‘full research’ paper; 2) the way the paper is currently written has too many issues, primarily in the definition of the measures, and mathematical formulations, which are confusing and difficult to follow; 3) while the authors argue that the proposed measures have been developed in the tool that can be used to create better quality datasets, there are no experiments to support this. To address this, significant amount of experiments should be undertaken to compare, for example, a number of state-of-the-art systems on both the sets of original datasets, and the sets of remixed datasets, to demonstrate the issues with original datasets and that the proposed measures do indeed address the dataset quality.
Addressing these issues will require significant amount of work, which in my view, might be very challenging. However, considering the importance of this topic and potential impact once these issues are addressed, I think the authors should be given another chance to consider a major revision. Detailed comments below.
1. Originality and significance of results
On page 2, with respect to the novelty of this work, the authors state that ‘… the work in [20] is brought up-to-date and consolidated. ... extended with new additional dataset measures, a standalone library …. as well as a vocabulary to enrich ….’. However this is not all clear how much contribution this work makes towards the *research* problem on top of the previous work. The library and the vocabulary are certainly interesting and useful, but as this paper is submitted as a full research paper, it needs to be evaluated against novelty in terms of the methodologies that address research problems, i.e., how to measure the quality of a dataset.
For this it is not clear what ‘new additional dataset measures’ are. And by comparing with [20] it appears that the novelty is rather limited, as the main measures (not annotated documents, density, prominence, confusion, dominance) have all been introduced before. Without experiments it is also not possible to evaluate how the three newly added measures contribute to address the research problem, and therefore, how ‘significant’ this novelty really is.
To improve this, the authors should clearly identify the improvement brought by this work in the introduction, and back that up with empirical evidence (see point 3)
2. Quality of writing
The paper has some major problems with its quality of writing. There is a large degree of inconsistent usage of terminology and mathematical notations. For example, I cannot understand how exactly you compute confusion and dominance. The authors should revise substantially their mathematical notations and have those double checked to ensure they are consistent and making sense.
First of all, at the beginning of section 2, define all the terms you will use in the following sections. What is a document and its notation? What is an entity and its notation? What is a surface of an entity (notation)? What are the surfaces for the entire dataset? What is a dictionary, is this the same for all different datasets? What are all the entities in the dictionary and all the surfaces in the dictionary and how to denote them mathematically?
Next, are your measures applied to dataset, document, dictionary, entity, or surface? You should have something similar to table 1 to clarify this.
I will now list confusing notations below on a page-by-page basis.
Page 3 section 2.1: \mathcal{D} is a dataset and t is a document. But then what is D in equation 1? (note that \mathcal{D} is different from D).
Page 3 section 2.2: how is len(t) calculated? You should define it here, not later in page 8 section 4
Page 4 left column bottom of the page: ‘the dictionary know to the dataset containing the document is \mathcal{D}’ ---→ but just before you said \mathcal{D} is a dataset, and now it is a dictionary?
Page 4 right column top: ‘the overall set of all possible surface form is \mathcal{V}_e’, by definition, \mathcal{V} denotes a set of *surface forms*. But just the paragraph before you said ‘the overall set of all possible *entities* for a surface form is \mathcal{V}_{sf}’, where \mathcal{V} are a set of *entities*
Page 4 – I also cannot understand the relations between the dictionary known to annotation, the dictionary known to the dataset containing the document (which document? Is dictionary different depending on document?), and the overall set of all possible entities for a surface form. Again, defining all these terms upfront and use examples can help clarify.
Page 4 equations 4 and 5 does not make sense. E.g., in eq. 4, you are summing the number of e, given s \in W. Firstly of all, W is never defined and I suspect you mean \mathcal{W}. Second, the condition has noting to do with e, so it does not make sense to have e conditioned on s \in W
Page 5 left column last paragraph: ‘… the amount of surface forms used for one specific entity in the dataset e(D)...’ this time e(D) is the # of surface forms for an entity but before on page 4 eq.1, you said e(D) is the number of not annotated documents. Also, maybe here you mean \mathcal{D} not D. But again because of the significant level of inconsistency I cannot be certain.
Page 5 just before eq. 6. “the average dominance for an entire dataset is computed over all entities e \in D”, OK but what is e \in E in eq 6 and 7? What is E?
Page 5 the example about ‘angelina_jolie’ does not make sense given eq. 6.
Page 6 first sentence ‘… for a dictionary W and an entire dataset ...recall is defined as … where max recall:’ is not a parsable sentence.
Page 12 figure 10: the right axis is missing.
3. Empirical results
The paper lacks empirical evaluation to back up their claims.
First, on page 3, section 2.3 you say ‘… a power law distribution of the pagerank values over all entities is assumed...’ I doubt this is the case. Did you check this using real datasets? Because if the distribution isnt so this measure would not make much sense.
Second, empirically how does your proposed measures help remixing new datasets that are better quality? You have mentioned from place to place the ‘correlation with precision and recall’ but this is not clear at all what P and R we are talking about: P and R of what system(s)? on what datasets? To be convincing, I think you need to run a number of state-of-the-art systems on the set of original datasets, evaluate their P, R, F1; then evaluate them again on a set of remixed datasets (remixed based on some rationale, which you need to define and justify) and compare the P, R, F1. If there are significant changes in their performance it could mean that 1) the methods are sensible to particular characteristics of some datasets. E.g., as you said, some datasets may be too easy; 2) by remixing you changed the nature of the dataset and that makes the dataset better quality (but you must justify why) and the task harder. In summary, you need to carefully design your experiments, identify datasets that appear to be imbalanced according to your proposed measures, remix these datasets, and run experiments to observe and analyse the difference (if any).
In your conclusion, you say ‘according to our evaluation, the best suited datasets for … are ...’. What do you mean by ‘suited’? Again, if you do the experiments suggested above, it is more convincing.
|
Comments
Correlation & Rank Correlation Analysis
This is the table I refered to as attachment. I am sorry that I am not allowed to provide it in a better formatting.
Babelfy DBpedia Spotl. Dexter Entityclassifyer.eu FOX KEA TagMe 2 WAT AGDISTIS Correlation Rank Correlation
No Filter 0.53 0.56 0.39 0.33 0.32 0.63 0.59 0.58 0.52
Persons 0.81 0.69 0.53 0.57 0.44 0.84 0.77 0.8 0.74 0.928476908948633 0.735456429244596
Org. 0.71 0.83 0.65 0.75 0.55 0.88 0.79 0.8 0.77 0.809032389920648 0.855723970397881
Places 0.77 0.82 0.57 0.55 0.54 0.78 0.81 0.8 0.75 0.962609017713317 0.782216917237991
PageRank 10% 0.68 0.76 0.5 0.48 0.39 0.79 0.74 0.75 0.63 0.97939505820968 0.911667111420618
PageRank 10%-55% 0.69 0.75 0.5 0.5 0.4 0.8 0.75 0.74 0.62 0.974564339690133 0.93727940573435
PageRank 55%-100% 0.72 0.7 0.48 0.46 0.36 0.81 0.74 0.75 0.63 0.980776643431662 0.923320523989795
HITS 10% 0.67 0.78 0.48 0.48 0.4 0.82 0.74 0.74 0.62 0.971750053927405 0.93727940573435
HITS 10%-55% 0.69 0.74 0.51 0.52 0.4 0.79 0.75 0.75 0.64 0.972568213887863 0.970914599352419
HITS 55%-100% 0.68 0.69 0.48 0.47 0.36 0.79 0.74 0.73 0.61 0.979745692600116 0.986907536948871