Review Comment:
This paper tackles the problem of finding, give two entities x1, x2, the most significant paths connecting them in a KG, so that a user is able to understand what relates the two entities. This main contributions of the paper are two algorithms to determine the paths between the two entities, some metrics to rank the found paths and a user study that compares a tool based on the introduced techniques with existing ones.
Note: I'm not an expert in the field, in particular I'm not aware of the state-of-the-art. I reviewed this paper as someone seeing this problem for the first time but having expertise in the field of Semantic Web.
Usual dimensions for reviewing the paper:
(1) originality, (2) significance of the results: The results taken singularly to not represent a breakthrough, but as a whole they give a nice contribution and overview to the tackled problem.
(3) quality of writing: the paper contains many mistakes (typos, problems with the definitions). On the other hand it is well organised and didactically well presented.
Main points:
- Please take care of integrating the comments below in a new version of the paper. There are things that are difficult to understand, sometimes I found errors in the definitions (which definitely doesn't help). In general the E4S is not well introduced, neither in the introduction 1, nor in section 4. In particular it was not clear for me at the beginning that you still fix two entities. Please also revise section 4.1., in particular the rational behind you measures.
- In Section 2 and 3, you ignore the schema properties like type, but you do not state it clearly. Right? Moreover the query patterns defined in section 3 do not take this into account. I guess that a huge performance gain can be expected there by filtering (excluding) rdf:type and wd:Q31 edges since these are generally very dense (ex dbo:Person, wd:Q5)
- The tool you created is not online. I see in this a big problem since your approach is not easily reproducible. Please either make it available as a web-service or put the code on github.
- The evaluation contains a big drawback. You are describing all performance metrics on the client side, but the most of the work happens on the remote sparql endpoints. And you are not using some SPARQL endpoint on a small infrastructure, but exploiting the dbpedia and wikidata endpoints which are designed to be able to handle a lot of requests. The best think would be to repeat the experiment on a local endpoint with some reasonable infrastructure and then get the performance metrics. If this is not done this must at least be critically discussed in the paper.
- You describe that you are only using SPARQL queries and consider this as a big plus. Surly it has it advantages but the big disadvantage is performance. I would also clearly state the bad part, i.e. that with suitable indexes you can have response times in milliseconds. This for me is a possible future work. SPARQL is known to not perform well on typical graph search operations like breath-search.
Minor Points
Abstract:
- First sentence, what you mean with “searching acttivities”? In the KG or in web search using KG? The same for browsing?
- Second sentence, you motivate relatedness before saying what it is in the next sentence, moreover the motivation is not so clear, especially “in a variety of KG”, sounds like multiple at the same time
- Before introducing E4D and E4S you should say how relatedness is coupled with paths
check diversity!
1.
- KG that are mantaining
- an even large number ? english ?
- extracting knowledge from KG has applications in twitter analysis? I think extracting is the wrong word here.
- One common-need for many knowledge discovery tasks is the explanation of relatedness between entities. The original sentence is not english
- fig 1 strange order of letters
2.
- Definition 1, triple pattern mot triple
- why did you not introduce a definition for KG and schema closure?
- In the definition 2 of schema graph closure you start from the schema closure S. S already contains the inferred triples, so why do you write (asserted or “infereed via RDFS reasoning”)? The same thing in the example? In Fig 2a you have the schema not the schema closure.
- I’m not an expert of RDF formalization. Did you made the distinction between KG and Knowledge Base or can you cite a paper defining it like this?
2.1
- the objective is to tackle the problem of explaining knowledge in KG, to fuzzy
- output, you speak about relatedness explanation, but it is not clear what it is at that point
2.2
- after definition 6, we are now ready “for”
- does the community agree on this definition of explanation, especially minimal explanation of are you the first formalising it?
3
- example 8, the third pattern is wrong
- figure 8, line 3 createThread, function called pathThread
- the algorithm leverages a monitor class an instance “of” such a class
- definition 10, definition 11, schema graph was called before schema graph closure
- definition 11, is this correct? I mean the domain and range of EVERY property should not be equal to p*
4
it was long time not clear to me that you fix both the explantation pattern and the source and target entity. I thought you are only fixing the first one and I found it wired. So be more clear with that at the beginning.
4.1
- write that PF = predicate frequency, IPF = inverse predicate frequency
- triples of the form, commas are needed to interpret correctly the boolean expression
- i would also describe in a sentence what this measure roughly capture, i.e. PF = the frequency of both pi and pj in the dataset, IPF = ?????
- to build a co-occurrence matrix “with entries”:
- the paragraph is to abstract, I’m still not sure if I completely got the measure
- theorem 12, I’m not sure this is interesting. It is not a theoretical work, I would say the times in the experimental setting are more important
4.2
- the algorithm for each of the most related predicates (line 4) “runs” in parrallel
- length 1 explanation patterns are expanded (line 11-25 ???? not 19????)
Algorithm 1
- Ts subscript
- line 33 tau, not defined
4.3
- On the othe hand, the last pattern (maybe change the word pattern, otherwise one thinks about the explanation pattern)
- definition 13, t_{x} t_{y} are dependent from i!
5
- Definition 14, Given G=(V,E,T) -> what is G, pay attention you do not have a definition for KG
- Definition 17 -> no relation between \eta and p!, you need to describe \eta
- Definition 19 -> you do not define Label, even if it is clear what it is
5.3.1
- People and films -> persons and films
6.1
- the tool and the dataset are available upon request -> I really dislike this. I’m mean either the code is published in github or the web is available as an online service. The scientific community must be able to reproduce and use in some way what you are doing easily.
- path and then merged -> are then merged
6.2
- usage for querying KGs. -> I find the naming very wired, please change it
- described in Section 6.2 -> section 6.2.1
7
- that it is a mac book pro is not important for the specs ; )
7.2.1
- Again the same questions, do you traverse rdf:type and wd:P31 properties. If you do you will loose a lot of time there i guess
8.1
- graph via path ranking ans -> and
8.2
- of defining of a -> defining a
Cordially
Dennis Diefenbach
|