ClassRank: a Method to Measure Class Relevance in Knowledge Graphs Applied to Wikidata

Tracking #: 1672-2884

Authors: 
Daniel Fernández-Álvarez
Jose Emilio Labra Gayo
Daniel Gayo-Avello

Responsible editor: 
Oscar Corcho

Submission type: 
Full Paper
Abstract: 
The use of collaborative knowledge graphs such as Wikidata or DBpedia has increased in the last years. Several agents such as organizations, universities or individuals have fed those graphs with their own knowledge, producing massive stores of general-purpose data. There are many approaches using the information contained in those initiatives in order to develop applications or to enrich their own data. Nevertheless, not all the topics available in those sources are covered with the same depth. The heterogeneous distribution of data quality causes that only certain portions of those graphs offer appropriate information to be used by third-party applications. We propose ClassRank, an algorithm based on aggregated PageRank scores which measure class relevance in RDF graphs, allowing to detect the sections with greater potential exploitation. We have successfully tested our proposal on Wikidata.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Jul/2017
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Concerning originality, I do not see a big difference with previous works. Although there is a new algorithm (ClassRank) the aim of the approach is to identify the relevant classes in KBs like Wikidata.

Concerning results, the conclusions do not provide unexpected findings. As authors say: "unsurprisingly, we have found that Wikidata is a human-centered resource". The result of this work is a listing with most relevant entities in Wikidata, that can be subject of controversial (see below).

Concerning the writing quality, the paper has some verbose paragraphs that could be simplified and several typos (see below).

====
Authors claim in the abstract that ClassRank can "detect the sections with greater potential exploitation". I would like to know what kind of explotation. Also the abstract tries to relate ClassRank with data quality, but I do not see how it is done.

"Discovering the distribution of relevance [...] may indicate which portions of the graph are better candidates to be used as a reliable data source". I suggest authors to provide reasons about the relation between relevance and reliability.
In section 2.1 says "[...]link counts of the most used elements can be found. However, none of these metrics include reports of relevance of classes/topics". Most used elements is not a measure of relevance? why?.

In section 2.3 ("aggregation of PageRank scores"), algorithms similar to ClassRank are derived to an "excellent review" paper. I would like to see a discussion about other algorithms (also based in PageRank variations) aggregation of PageRank scores, as it is ClassRank.

Perhaps could be useful to know in more detail the process that generates RDF from Wikidata.

In section 5.2, as you notice, the values selected for security threshold and damping factor are arbitrary. I would require this study in the paper.

In section 6.2. The second paragraph is too dense. I recommend a detailed review to enhance comprehensibility.

Possible typos
---------------

- "example os this" --> of
- YAGOO --> YAGO
- "to shut down its own project an to offer offer Frebases's" --> "to shut down its own project and to offer Frebases's"
- instaNCE --> instance
- "the interest of not introducing" --> "order to avoid"
- which Dewey --> each Dewey

Review #2
By Giuseppe Rizzo submitted on 31/Aug/2017
Suggestion:
Major Revision
Review Comment:

This paper proposes ClassRank an algorithm to rank classes according to the importance of the linked resources of a knowledge graph. Authors have presented an experimentation of this algorithm with the Wikidata graph and they reported the ranking of classes. The algorithm is presented with a good and sound scientific fashion. A demo is publicly accessible and source code has been shared publicly.

The algorithm is declared to be agnostic the to graph, an assumption that is not verified experimentally, but this is acceptable being ClassRank based on the aggregated PageRank scores of entities (resources with URIs) linked a class.

Authors refer to class relevance in the paper title which gave me wrong expectations while reviewing and I found not scientifically demonstrated and supported ultimately.
Relevance for what? Relevance for whom?
The concept of relevance is usually dependent of the application domain and usage (we usually say that something is relevant when it is pertinent for something). For instance, when referring to Cycling, a relevant class is skewed towards the linking of facts and concepts that document more and deeper Cycling as sport. While this paper seems to target more the concept of importance, i.e. a class that is important (or significant) in a given graph. In addition, the claim proposed in the paper supports that ClassRank is adaptable to any domain since the domain is actually transparent as the generation of the ranking is based on the pure weight of entities in the entire graph. I thus suggest using importance that is also underlined in Sec 3.

Being the notion of a class often not explicitly and universally defined in terms of semantics in a graph, authors have also attempted to cope computationally with the identification of classes (this is presented as RQ1). The approach is based on a first step of graph pruning of statements with literals that is followed by a simple count strategy of the class.

In Sec 1 authors set the scene, list the research questions and their implications in current research trends. I recommend to move the last part of the section where future works are listed to Sec 7.

Sec 2 reports existing research endeavours in computing the relevance of entities in graphs and queries, then it gives further pointers to the usage of aggregated PageRank scores and concludes giving an informative overview on Wikidata.
There a bit of ambiguity around the term relevance, and the authors report on past activities exploring both centrality and relevance, but they do not refer to importance. This needs to be addressed.
There are some comparisons with state of the art approaches that need to be further worked out, e.g. "usage of these scores are quite different" how? "none of these metrics include reports of relevance of classes/topics" does this paper, instead?

Sec 3 lists the definitions of class and class-pointer (i.e. predicate whose the object of the statement is a class).

Sec 4 presents both the ClassRank algorithm and the approach of detecting class-pointers. For the latter, the underling principle is based on property frequency and importance of the object. There is a dropout mechanism for the statements that have as objects literals. The idea is interesting, but the paper lacks in detailing how to assess them computationally. Manually assigned? Empirically verified and tuned? Knowledge graph dependent? A proper formalization on how sigmac should be computed is needed.
Concerning the ClassRank algorithm, please verify Algorithm 2 as, if I interpreted correctly, line #4 and #7 it should be P'.
In Algorithm 1 and 2, please mention the range of lambda.
One of the strongest points of PageRank is also a peculiar problem: it favors very popular resources. This is acceptable in static or semi-static domains, but it is not when the domain is subject to frequent changes (as it happens in Wikipedia) and the importance cannot be anymore only based only on popularity. The proposed algorithm fails in coping with such a case, thus I wonder if this is considered as threat by the authors and how to address it.

Sec 5 presents a quantitative evaluation of the selection of properties and the ranking obtained by the algorithm. This is a first step towards understanding the usefulness of the approach. However, it is missing an exhaustive quantitative evaluation of both class identifier and ranking over a standard benchmark.
The reported statistics are surely helpful, but they do not help in stating whether this approach is precise or not.
There is also a lack of a baseline that would need to understand the limitations of current solutions. Authors should aim to address this point.

Sec 6 provides an output analysis of the algorithm, as a first towards a qualitative evaluation. The level of details is fair, however the authors should provide more insights on the errors being produced by the algorithm. This can complement the quantitative analysis with a reference.

(1) originality

The methodology extends PageRank to classes by inheriting the entity popularity. This approach is an incremental work of previous innovations. Authors referred to other research activities around the topic of centrality and relevance, which are not discussed though.

(2) significance of the results

Results lack from a comparison with state of the art approaches, a qualitative analysis with a benchmark reference, and a deep qualitative analysis.

(3) quality of writing.

The paper is well structured and the easy readable. I encourage the authors to give a few proofreadings to make the text lighter and more comprehensible. I also recommend to stick with scientific writing using capital letters when pointing to sections, algorithms and tables, using the singular form, and avoiding plurals (figures->Figure X). Digits in 0,3->0.3 (see Figure 2 and 3).

Overall I recommend a major revision.

Review #3
Anonymous submitted on 01/Nov/2017
Suggestion:
Major Revision
Review Comment:

This paper presents, ClassRank, an algorithm for calculating the relevance of a class in a given knowledge graph which is based on PageRank algorithm. In their work, the authors aim to answer two research questions: (a) Which is the most accurate way to detect classes in an RDF graph?, and (b) how to measure the relevance of those classes in a given KG? An experiment on Wikidata is used for evaluating the results.

The paper is well-structured and the writing is clear. It addresses an interesting problem that is relevant to the Semantic Web community and the problem is clearly described. The originality of the paper is related to the adaptation of the PageRank algorithm to a slightly different problem of detecting classes and measuring their relevance. The paper provides a good discussion about the reasons why the original PageRank algorithm does not work for the given problem. The paper presents interesting results but the main weakness is how the results are evaluated. The design of the experiment is quite weak. The paper would improve a lot if the experiment can be designed with the main hypotheses of the paper in mind and if the results are evaluated properly in a statistically significant manner. Quality of the writing can be improved.

I suggest a "Major Revision" since I believe that the authors' approach is interesting for the research community, but the experiment design and evaluation of the results have to be improved.

Detailed comments are listed below.

Introduction:

(1) In the introduction, the authors relate their work to the completeness dimension of Linked Data quality. However, the rest of the paper is focused on how relevant a given class in a KG (which is the topic of the paper) and does not discuss/relate to completeness. Though this is completely OK, I wonder if the authors thought about how ClassRank relates to completeness.

(2) I would recommend rephasing the third paragraph (In the universe of RDF graphs …) to unify the terms used in that paragraph (e.g., concept, URI, element, topic, content, link, subgraph) with the RDF terminology in the W3C spec [1]. At the moment, I am not sure if the terms are used with the same semantics. For example, when authors say any element can be considered a topic, it is not clear if they mean all IRIs, blank nodes, and datatyped literals. Similarly, when the authors talk about the concept of “class”, it would be interesting for the readers if the authors explicitly relate it to rdfs:Class and/or owl:Class, for example, if the authors mean exactly the same or a more broader/narrower concept.

(3) The research questions are clearly stated in the paper, and it would be useful for the reader if the authors mention their hypotheses also explicitly, maybe in a later section of the paper. It will help the readers to grasp an idea of the basis of the proposed approach.

Related Work:

(4) Are there any existing work related to RQ1. - Which is the most accurate way to detect classes in an RDF graph?

(5) I find it odd to say “Wikidata contains more than 26M URIs”, isn’t it better to say entities or resources instead.

Algorithm Description:

(6) In the organization, I wonder why class-pointer detection was not incorporated into the algorithm directly but was rather used as an input? In the current flow of the algorithm, the PageRank calculation (Stage 1) has to be done before the preprocessing stage as Algorithm 2 uses that to filter top K relevant entities, right?

(7) In the preprocessing stage, the authors use a straightforward algorithm to detect class pointers under the hypothesis that if there are significant number (class security threshold) of triples that comply with the pattern [?, Px, Oy] where Oy is a top K relevant entity according to PageRank, Px is a class pointer. Once all the class pointers are detected, a manual check is proposed to eliminate false positives. Authors acknowledge false negatives but argue that false negatives are only because of the low-ranked entities, thus, not so important. However, I believe there could be false negatives also because the proposed hypothesis is not always true. One obvious case would be having a class pointer that will not reach the class security threshold either because it’s less frequent in the KG (not many [?, Px, ?] triples) or because it is typically pointing to diverse classes ( [?, Px, O1], [?, Px, O2], [?, Px, O3], …). Thus, I recommend to explore a bit more on how to handle false negatives.

(8) Isn’t Algorithm 2 and Stage 2 of Algorithm 1 following almost the same logic but selecting properties in one case and classes in other case operating on almost the same matrix (except for the filtering of relevant entities)? Isn’t it possible to perform this on the same iteration?

(9) Even though the informal description of stage 3, i.e., calculating the ClassRank by aggregating of page ranks of instances of each class it will be useful to the reader if the L14-20 are explained in the text.

(10) Authors provide a good explanation to motivate the decision of not applying inferencing. However, similar reasoning is not provided to other decisions, for example not using the PageRank of the class node itself when calculating ClassRank of a class but only using the PageRank of instances.

(11) Though I agree on the similarities of the Web of documents and Web Data, I would recommend rephrasing the interpretation of ClassRank more oriented to knowledge graphs. “the probability a random surfer to land on any of the instances”.

Experiments:

(12) In general, it would useful if the hypotheses that are evaluated by the experiments are explicitly stated in the experiment design. This will increase the readability of the paper and also help to evaluate the significance of the results.

(13) In the adaptation for Wikidata experiments, on the one hand, the triples involving literals is a generic issue that is not specific to Wikidata, isn’t it? Can it be part of the algorithm? On the other hand, the rectification issue is rather specific to Wikidata. Do the authors plan to ignore blank nodes in all cases or only in Wikidata? According to the text, the rectified triples were completely excluded. Isn’t it possible to transform them to a non-rectified form and include them?

(14) Since in Section 4.1 it was claimed that there are significant rates of false positives and false negatives in the classes detected in Wikidata using the “instance of” and “subclass of” strategy, it will be useful to include some statistics on how common are they on the reference class list mentioned in Section 5.2.

(15) According to the results, the algorithm has a precision of 19.93% in the automatic phase but this is rectified by the manual validation phase. Nevertheless, since there is no mechanism to identify the false negatives in the experiment, it is hard to get an idea about the recall of the algorithm. I believe it’s a big drawback in this experiment setting.

(16) It’s not clear from the paper how the security threshold values were chosen. Does this depend on each knowledge graph? Can it be learned somehow? It seems another drawback of this experiment.

(17) It is not clear the goal of the second part of the experiment. By comparing to other approaches, it is shown that the ClassRank detects more classes. However, it does not necessarily mean that the detected classes are correct. A broader evaluation is needed than only checking the top 25 results with Dewey Decimal Classification.

(18) The selected different approaches for comparison are all kind of partial steps of the proposed method. Are there any other methods in the state-of-the-art that can be used for comparison? Can the classes listed by Wikidata itself be used as the baseline?

(19) How the average relative ranking is used in the experiment has to be explained with more details. What is the hypothesis that is evaluated in this step of the experiment?

Conclusion and future work:

(20) It will be useful to discuss the current limitations of the algorithm, for example, the lack of mechanisms for determining the optimal thresholds, etc.


Comments