Review Comment:
This submission describes HyS, a system that computes the atomic decomposition (AD) of an OWL ontology. AD reflects the modular structure of the ontology by compactly representing all its modules, for a certain logic-based module notion. HyS computes AD with modules based on syntactic locality as the underlying module notion. The approach uses the hypergraph representation of an ontology, which neatly generalizes the notion of reachability-based modules, which were originally devised for lightweight ontology languages. The hypergraph representation additionally allows HyS to compute single modules faster than current implementations. The submission describes the underlying approach, the architecture, and an evaluation of HyS on nine biomedical ontologies.
In a nutshell, the approach is elegant, the performance HyS is impressive, and the problem solved by HyS is a relevant one. The two downsides of the submission are that (1) HyS works for EL ontologies only and it is not said how easy or difficult the planned extension to SROIQ would be; and (2) presentation lacks clarity in some places, particularly in the technical part. I therefore recommend to accept the the submission under the condition that the presentation will be improved.
Details according to the evaluation criteria
--------------------------------------------
* Quality
The approach taken is elegant because it is based on an intuitive hypergraph representation of the syntactic dependencies between axioms, and those dependencies directly affect the formation of a locality-based module. Computing the AD and extracting a module neatly boil down to computing (strongly) connected components in the hypergraph. According to the evaluation provided, both tasks can be performed by HyS significantly faster, often by 1-2 orders of magnitude, than with (the few) existing tools.
* Importance
Module extraction and modularization of ontologies have been widely recognized as important reasoning tasks for ontology development, reuse, and comprehension. HyS provides infrastructure for solving these tasks automatically. However, the importance of HyS has two restrictions. First, HyS only applies to the EL fragment of OWL. I do not see this as a problem per se; I am just concerned that this tool report is a bit premature because the authors write that they plan an extension to SROIQ (OWL), and it is not made clear how easy or hard this extension would be to realize. The description of the technical approach suggests that SROIQ is covered; so if an extension is straightforward, then I think that it is more appropriate to publish a tool report on "full" HyS.
Second, both AD and logic-based modules are still difficult to use in practice because (a) the connection between the modular structure of an ontology as reflected by AD and its _logical_ structure in a more intuitive sense is still not fully understood, and (b) it is still hard to extract the "right" module because one specifying a seed signature is non-trivial, and current tools (including HyS, as I understand) do not provide support for this initial task.
* Impact
Provided that problems (a) and (b) are at least partially solved by the community, I expect HyS to be successful in application areas that involve modularization and module extraction -- even with its restriction to OWL-EL because many biomedical ontologies are written in OWL-EL.
* Clarity, illustration, readability of the submission
The submission consists of three parts: a condensed version of a previous conference paper by the authors describing the technical approach that underlies HyS, a description of the HyS architecture, and an experimental evaluation. The second and third part are new and make the submission fit into the category 'Tools and Systems Report'. They are mostly clear, but the research questions need to be formulated more explicitly and some design choices require discussion (see detailed comments below). Furthermore, the first part is quite high-level. On the one hand, I appreciate the efforts undertaken by the authors to make the text readable for the general SWJ audience, and the main ideas are all there. On the other hand, I have the impression that some parts of the technical sections are now rather vague and confusing. I will point them out and make suggestions below. Altogether, clarity and readability of the paper will be without objection if the authors manage to find a better tradeoff between the summary character and understandability of the technical part.
Detailed comments
-----------------
* The current paper should be clearly delineated from the authors' previous work. Altough the WoMO'13 and IWSC'14 papers are cited, it should be made clear which parts of the current paper repeat or summarize their contents. Furthermore, the module extraction algorithm (Fig. 1) is just a slight variation of the one in [3] and should therefore be attributed to [3].
* Section 2.1: it is ok to mention safety, but besides safety, locality guarantees that the module encapsulates knowledge about Sigma, and that seems to be equally important.
* Sec. 2.1: You pick two out of "several syntactic locality notions". Why these two? Why can the others not be handled (or why are they perhaps unimportant)?
* Sec. 2.2: The first sentence is vague. It is important to make the scope of "modules" clear: for a _fixed_ module notion x, one considers the modules for all _seed signatures_ \Sigma. Then two axioms occur in the same atom if they co-occur in all these modules.
* Sec. 2.2, first par: The claim that the atoms partition the ontology relies on two assumptions: (a) the ontology contains only logical axioms, and (b) there are no axioms that are non-local w.r.t. the empty seed signature. These assumptions should be stated in a central position.
* Sec. 2.2, "poset" is jargon, in particular for the mathematically less adept readers. The curly symbol \succeq needs to be linked to the notion "depends". I suggest to drop "poset" and the bulky notation (Atoms^x_O, curly symbol) and instead say that the set of atoms together with dependency constitute the AD. In this spirit, the previous use of "Mod_O^x(sig(alpha))" can be replaced by a more generally understandable precise verbal explanation. Furthermore, dependency between atoms has a similarly important meaning as coocurrence within the same atom; so if you explain one, I think that you should explain the other too.
* Sc. 2.3: "some of the most prominent tools" sounds as if there were lots of tools for extracting locality-based modules and computing the AD.
* Sec. 2.3: The OWL API sould be summarized in a couple of sentences. I do not see the use of the version numbers of the tools in this section.
* Sec. 3, first par: Is there a specific reason why the current HyS version supports only bottom-locality? Is the planned extension to top-locality easy or hard? The conclusion says that an extension from EL to SROIQ is planned. Would this be challenging or straightforward?
* Sec. 3: hypergraphs are not really a standard notion, and many SWJ readers may not be familiar with it. I think it should be introduced informally quite at the begin of the paper. Furthermore, it is not obvious how to define the notion of a (strongly) connected component of a hypergraph; so this should be explained before it is first mentioned.
* The examples in Section 3 seem to be overly simple: they do not even involve any existential quantifiers. Also, I suggest to use one running example to demonstrate the ADH as well as its SCCs later. That running example could also accompany the description of the pcADH. Additionally, labels should be provided for the dashed circles "scc_i" in Fig. 6+7 -- these would help precisely with the difficult parts of the preceding explanations.
* Section 3.1 is called "Atomic Decomposition", but what it describes is the computation of a (p)cADH. The central statement that the two coincide is missing.
* Section 3.1: is the notation size(G) and size(H) ever used again?
* Section 3.1: the functionality of collapse_SCCs(.,.) is described three times on pages 5 and 6.
* Section 3.2: The second sentence says that a locality-based module can be computed by computing "the" connected component in an ADH. But which component needs to be taken? Somehow th
* Section 3.2: I think that some words on the complexity of the algorithm in Figure 8 are in order. At first glance, it looks as if extracting a module involves computing a connected component in an exponentially sized graph, while the standard algorithm for syntactic locality-based modules runs in polynomial time. Without further explanation, the algorithm looks suboptimal, and one would not expect the good performance reported later.
* Section 3.2: As general remark, I like the idea of using ADHs to extract LBMs because ADH seems to be the right "data structure" to ectract an LBM in a more goal-directed way than the original algorithm in [3], which makes a lot of unnecessary locality checks. I am also aware that Dmitry Tsarkov was able to reduce the number of locality checks in a more efficient implementation of the module extractor in FaCT++ a while ago. It would be interesting to contact him and find out whether his optimizations can be explained by your approach (but I cannot rule out that they are based on different intuitions).
* Section 4.2: What difference does it make whether HyS uses the ADH or the (p)cADH? Is it a choice that should be left to the user?
* Section 4.2, just a suggestion for the command-line interface: it would be helpful to have the option to save the AD/modules into a folder different from the one where the signatures are stored.
* Section 5.1: The mentioning of normalization raises the following question: if you split axioms via normalization and then compute a module, how do you make sure that the module is still a subset of the previous non-normalized ontology? For example, an axiom may be normalized into two axioms, and only one of those may end up in the module. If you just "de-normalize" and extend the module by the second axiom, you have increased the module signature. Can you describe more precisely how (de-)normalization and your module extraction algorithm interact?
* Section 5.2: Why is it important to know that nodes and symbols are represented using integers? And what is gained by this choice? I would expect that the integers get wrapped into a containing object anyway?
* Section 6: The first sentence starts with "For the evaluation of HyS ...". What exactly do you want to evaluate? Performance? Scalability? Comparison with other tools? Using pcADH versus cADH? Please state the research questions explicitly. For example, it is said later that it is "interesting to measure the impact of using two different programming languages" -- is this one of your research questions? If so, what insights have you gained from the experiments?
* Section 6: Some design choices of the experiment require discussion: How/why did you choose the nine ontologies? What makes you confident that this sample is appropriate to yield reliable answers to your research questions? Furthermore, I don't find it obvious to exclude CEL on the grounds that it hasn't been maintained for five years. After all, to my knowledge, CEL computes _reachability-based_ modules which are much closer to the approach reported here, and CEL may indeed perform better than the other (general-purpose) module extractors.
* Section 6, page 10, last paragraph, 14th line from below: this sentence says that using cADH doesn't achieve a significant speedup compared to using pcADH, "despite" the fact that cADH is slightly smaller than pcADH. But two sentences before, it says that the differences in computation time correspond to (are proportional to?) the size differences between cADH and pcADH. It seems to me that both sentences convey the same information, and that the "despite" is misplaced.
|