Review Comment:
The paper presents an interesting approach to measure the evolution in time of an RDF graph, based purely on the data contained in the graph and not on other artifacts, e.g., an ontology. The approach is based on frequent pattern mining and using the patterns for compressing the graph. Most of the contribution of the paper is presented in Section 4, where methods for measuring structural similarity are presented.
I find the idea and approach noteworthy and original, but the experimental evaluation is not very convincing and the quality of writing is quite poor. Detailed remarks are presented below.
Abstract:
* It seems to me that there are three names for the same thing: “semantic web repository”, “base”, “knowledge base”. Please select one and stick with it throughout the whole paper.
* “Semantic Web” is possibly not the best keyword when submitting to the Semantic Web Journal.
Section 1:
* Page 1: A movie application will indeed expect some structure in the graph, but this is completely different problem: the application must have some assumptions, which should be formalized (e.g., as SHACL constrains) and used to test a data source before it is used. Few sentences later the authors remark that the vocabulary should be seen as a statistical distribution, but it only underlines the problem: if 80% of movies have a release date, then the remaining 20% do not have it and, when processing them, the functionality of the movie application is degraded.
* Page 2, left column, lines 19-32: I find it confusing to consider “updates” and “snapshots” as local and global view of the same thing. To me, a snapshot is a description of a fixed state, while an update is a description of a process, a transition between two fixed states. Now both of these can be defined on different granularity levels. In particular, in my opinion, Definition 5 presents an update as two local snapshots: before the update and after the update.
* Page 2, right column, lines 13-15: “our proposal can discover those implicit categories”. I find this to be a clear reference to the area of ontology learning, particularly to finding new subclasses. Please discuss works related to this topic, e.g., Jedrzej Potoniec, Agnieszka Lawrynowicz:
Combining Ontology Class Expression Generation with Mathematical Modeling for Ontology Learning. AAAI 2015: 4198-4199.
* Page 2, right column, lines 25-26: “consolidated versions of a KB (snapshots)” → “snapshots (consolidated versions of a KB)”
Section 2.1:
* Page 3, left column, lines 23-24: “supp(…) = 3, which means that”. It should be the other way around: “{a, b} is supported by …, which means supp(...)=3”
Section 2.2:
* “Cover and Usage”: This paragraph is unclear and relies too much on intuition of the reader. There should be formal definitions of the introduced concepts as well. Moreover it is not clear for me why a greedy approach to covering is used.
* “Code Length”: L(code(X)) is defined as a logarithm, which is a real number. It is thus hard to expect that there will be an actual code of such length. It is also not immediately obvious in what sense this is a correct definition. Consider a database of 65 transactions: 64 transactions {a} and 1 transaction {b}. usage({a})=64, usage({b})=1 and, respectively and assuming binary logarithms, L(code({a}))=-log(64/65)=0.02, L(code({b}))=-log(1/65)=6.02. Neither of these lengths makes sense to me, as I can neither encode {a} using 0 bits, nor need to use 6 bits to encode {b}. I admit I may be mistaken in my reasoning here, but if so, this also should act as an indicator that perhaps this section needs some extending.
* L(CT|D) “D” is irrelevant here, as length of the code table does not depend on the database it was derived from. It is also not immediately clear for me why there is L(code_SCT(X)) in the sum. If I understand correctly, it is possible for X\in CT to contain more than one element (|X|>1), and then by definition of SCT X does not belong to SCT and code_SCT(X) does not exist.
* Table 1: using colors as a sole way to encode different codes is nice as long as your color vision is right and your medium supports color. This is particularly true for SCT, where all the codes are black and it seems that pairs {a}/{b} and {d}/{e} share the same code, whereas if I understand correctly, they share the same code length, but not the same code. Please provide additional way of coding, which will behave well while printed in black and white.
Section 3.1:
* Page 4, right column, line 29: “triples” → “properties”
* Definition 3: at this point it is unclear how the sets C and P are exactly defined. The main difference is between “intensional” definition C={c|(c, rdf:type, rdfs:Class)\in B} and “extensional” definition C={c|(*,rdf:type,c)\in B}. This becomes clear only after reading Definition 4.
* This is a side remark and I will not require it addressed in the revised version: expressiveness of Definition 3 is very limited. For example, adding additional birth dates to a person will remain unnoticed. I would suggest, as a future work, using more expressive language here or, possibly, even a pattern mining approach to define features. For example of such an approach, consisting of feature construction by pattern mining and propositionalisation, see: Agnieszka Lawrynowicz, Jedrzej Potoniec: Pattern Based Feature Construction in Semantic Data Mining. Int. J. Semantic Web Inf. Syst. 10(1): 27-65 (2014)
* Definition 4: P(I_PCB) – I think “P” is an undefined symbol here.
* Page 5, left column, line 25: “in related to” → “is related to”
Section 3.2:
* Definition 5: AR(u) should be formally defined.
Section 3.3:
* I find abbreviating the names confusing and prefer using full names instead.
* The running example with its X, Y and Z is not very user friendly. Perhaps it could be a description of a real movie?
* Page 6, left column, line 27: “have yet” → “are yet”
Section 4:
* The names of subsections are very strange. If I guess correctly they are abbreviations and stand for, e.g,, “Measuring Structural Similarity with Patterns Through Compression”. Please do not do that.
Section 4.1:
* English in the first sentence is particularly strange.
* Definition 7: I find it surprising that similarity is (a) not symmetric; (b) decreasing with objects becoming more and more similar. Please consider renaming it.
* Line 31: “Typically” → Why typicaly? What makes this typical? What are non-typical examples?
Section 4.2:
* Please provide appropriate equations when discussing Laplace smoothing.
* This section introduces the word “codify” to denote encoding using a code table. I find this very surprising and I think that “encode” would be a much better choice (of course, this is relevant for all the other uses of “codify” in the paper as well)
* Please provide more details for calculating L(q’|CT_1): expand the sums and preferably compute the numeric value (the same goes for other calculations in the running examples)
Section 4.3:
* Page 8, left column, line 44: “the number of bits” → explain why is so.
* Page 8, right column, lines 5-12: “To compensate...” → please explain in more details the rationale and show that it has some desirable properties. Currently it looks very ad-hoc.
* Page 8, right column, line 15: “it can be proved” → Please provide the proof.
* Page 8, right column, lines 29, 30: “status of the knowledge base” → “state of the knowledge base”
* Page 8, right column, line 43-45: I think that whenever a singleton is an argument of the code function, it should be code_SCT and not code_CT
Section 5.1:
* Page 9, right column, lines 25, 26: “assymetric” → “asymmetric”
* Page 9, right column, lines 41-47: Why is this relevant? It does not seem to be used anywhere in the paper.
Section 5.2:
* I do not understand Figure 2 and currently I do not think it improves the paper. Please describe it more clearly or remove it.
* Page 10, left column, lines 16-17: This would require some sort of aggregation, which is not discussed here.
* Page 10, right column, lines 20-23: The scenario with different ETL pipelines seems to me quite forced.
* Page 10, right column, lines 31-45: I think that you should disentangle interpretation and administrator actions. If I understand correctly, the interpretation is objective/task independent: a positive value indicates a converging structure, while a negative value indicates a diverging structure. The administrator action is, indeed, subjective/task dependent.
* Page 10, right column, lines 48-51: How the administrator is supposed to do the comparison?
Section 6:
* There should be clearly stated and highlighted hypotheses/research questions for all the experiments. Otherwise it is impossible to judge whether the experiment actually makes any sense.
Section 6.2:
* I find most of this section uninteresting. LUBM is fine for benchmarking, but of little use otherwise. In particular, discussing behavior of measures on a synthetic dataset is quite pointless, as this does not reflect the expected behavior in the real-world scenario.
Section 6.2.2:
* Page 12, left column, new definition for sim: This definition uses a new function “ratio”. I assume this is the same as “L%” defined earlier in the paper. If it so, then this is a theoretical property of the measure and you should discuss it earlier in the paper, provide a proof and remove Section 6.2.2 entirely.
Section 6.2.3:
* The hypothesis should be changed to theorem and proved: what exactly is a disrupting update should be defined and then, from this, you should show that this implies that the delta will be positive. (The same goes for a restoring update.) Currently, you try to simulate them, but the simulation is flawed and, in the last paragraph, you arrive at conclusion that sometimes a disrupting update actually improves the situation.
* Similarly, you experimentally show that the higher probability of disruption, the wider the range of deltas. This is some kind of monotonicity and, again, should be proved: if an update X is more disruptive than Y, then delta(X)>=delta(Y).
Section 6.3.1:
* Page 13, right column, lines 15, 16: As far as I know, dbp: namespace is not part of the DBpedia ontology. It represents raw results of information extraction from Wikipedia. I think you would be better off not using this namespace: the graphs would be smaller and more regular.
* Page 13, right column, line 38: “compression ratio between 0.260 and 0.300” → but compression ratio for 2016-10 is 0.306. Also, I wonder if compression ratio is the same as “L%”?
* Table 6 and the text describing it in the left column of page 14: honestly, I do not clearly see the two clusters as described, e.g., sim(3.6|3.9)>sim(2016-10|2015-10)>sim(2016-10|3.9). If this definition of clusters is the result of some algorithm, then it must be clearly stated.
* Page 14, left column, last line: “shows scalability of our approach” → I appreciate that you were able to perform the comparison in 5 minutes, but you fail to mention that, in order to perform the comparison, you spent overall 216 hours computing the code tables. Also, seeing that you had to extend the deadline for computing the code tables for 2015-10 and 2016-10, I suspect that scalability could be an issue.
* Page 14, right column and Figure 6: It seems to me that you claim that the complexity is O(#NoSingletonCodes*#CodifiedTransactions). Would it be possible to explain why is that instead of proving it by line fitting?
* It seems to me that Table 8 is not referenced anywhere in the text. Also, I am a bit surprised that the number of non-singletons is the lowest for 2014. Why is that?
Section 6.3.3:
* Page 15, right column, lines 39-43: I disagree with the conclusion that there exists a source using an outdated schema. It may be as well a sign of data evolution. If I understand correctly, for this to happen, it is enough that there was a group of resources with little description in 2014. Most of them got described in 2015 and the updates fits 2015, but a small number remained underdescribed and fits 2014 schema better.
* Page 16, left column: “punctual”, “raise awareness” are probably not the best wordings in this context.
* Figure 7 is quite large, but not particularly interesting.
Section 7:
* The problem considered in the paper is not new and this I would rather see this section in the beginning of the paper, explaining the novelty of the approach.
* This section is mostly concern with methods for assessing the evolution of the data. However, as the proposed approach is rooted in pattern mining, I find this section lacking in references, even to works published in the same journal it was submitted to, e.g., http://www.semantic-web-journal.net/content/discovery-emerging-design-pa....
* Page 17, right column, line 39: „obtention” is probably not quite the right word there.
|