Separability and Its Approximations in Ontology-based Data Management

Tracking #: 3288-4502

Gianluca Cima
Federico Croce
Maurizio Lenzerini1

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Given two datasets, i.e., two sets of tuples of constants, representing positive and negative examples, logical separability is the reasoning task of finding a formula in a certain target query language that separates them. As already pointed out in previous works, this task turns out to be relevant in several application scenarios such as concept learning and generating referring expressions. Besides, if we think of the input datasets of positive and negative examples as composed of tuples of constants classified, respectively, positively and negatively by a black-box model, then the separating formula can be used to provide global post-hoc explanations of such a model. In this paper, we study the separability task in the context of Ontology-based Data Management (OBDM), in which a domain ontology provides a high-level, logic-based specification of a domain of interest, semantically linked through suitable mapping assertions to the data source layer of an information system. Since a formula that properly separates (proper separation) two input datasets does not always exist, our first contribution is to propose (best) approximations of the proper separation, called (minimally) complete and (maximally) sound separations. We do this by presenting a general framework for separability in OBDM. Then, in a scenario that uses by far the most popular languages for the OBDM paradigm, our second contribution is a comprehensive study of three natural computational problems associated with the framework, namely Verification (check whether a given formula is a proper, complete, or sound separation of two given datasets), Existence (check whether a proper, or best approximated separation of two given datasets exists at all), and Computation (compute any proper, or any best approximated separation of two given datasets).
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/Nov/2022
Review Comment:

The authors revised their paper and responded to the comments of the reviewers. I am happy with the new version.

Review #2
Anonymous submitted on 04/Nov/2022
Review Comment:

The authors addressed my questions and I am happy to accept the paper.

Review #3
By Jean Christoph Jung submitted on 07/Nov/2022
Major Revision
Review Comment:

I had another, more detailed look at the paper, and to be honest, I find it relatively cumbersome to read due to several reasons. I seriously think that the paper needs a major revision before it can be accepted. (Sorry that I took the more detailed look only in the second round).

I am not satisfied with the use of the signatures.
- It is not good standard to use the symbol \mathcal{O} both for the ontology and the underlying signature.
- Is it necessary to index all queries with their signature? It makes notation heavy
- “UCQ q over signature S” -> “S-UCQ”?
- It seems that an assumption underlying OBDM systems is that the signature of the database is disjoint from the signature of the ontology (otherwise, M(D) is not guaranteed to be finite), but I didn’t find it explicitly mentioned
- Interpretations usually give meaning to EVERY concept/role/constant name over a countably infinite supply of concept names, role names, constants. It is rather standard to fix these infinite supplies beforehand; ideally, very early in the preliminaries
- It might be the most uniform solution to include a source and a target signature in the OBDM system (as it is very common in data integration settings)

I find the preliminaries not ideal.
- The general discussion of queries (p6, lines 46-51) is not necessary (and not helpful)
- P7, lines 13-16, not necessary (and it seems never used!)
- The semantics of conjunctive queries (one of the main notions!) is introduced relatively informally (“Observe that …”). Make it explicit that you define it via homomorphisms
- I personally prefer q(D) instead of q^D, but this is a matter of taste
- The notion “set(phi)” seems undefined
- It seems cumbersome to separate the definition of an OBDM system into a database and an OBDA specification: after all, almost everything in the paper seems to use OBDM systems directly. So why not just introduce OBDM systems as a 4-tuple (or rather a 5-tuple in case the target signature is added)?
- Semantics of OBDM systems unclear: first, at this point it is not clear what is M (defined only later); second, shouldn’t I be a model of the database as well? Third, notation “\langle I, D\rangle” as the set of facts over S\union O (p8, line 28/29) is unfortunate: neither S nor O appear in \langle I, D\rangle. And I am not sure something like this is needed: why not just I\models M?
- A better order might be:
1. DL-Lite + semantics
2. Conjunctive queries + semantics (but one can give the semantics directly in terms of (finite or infinite sets of facts))
3. OBDM systems + semantics of query answering
4. Finally, the necessary preliminaries on perfect ref, canonical structures, and map ref (now, especially this last point is mixed with the other things).

I agree with R1 that the ontology language is essentially always DL-Lite_R and the query language is always UCQ, so they should be dropped from the notation. You can say in the few cases where lower bounds hold already without an ontology/CQs precisely this in the theorem (similar to what you do already).

The naming of a sound / complete separation is misleading: Isn’t the “separation” part removed when you talk about soundness / completeness? Moreover, isn’t your “complete separation verification” just query answering over OBDM systems: q is a complete separation query if all positive examples are certain answers to q. Thus, the complexity is (essentially) inherited from query answering (the complexity of which should be known). This sheds new light on the entire section 5; it should simplify a lot. In fact, given what was said above I wonder whether the complete / sound separations (without minimality/maximality) are very interesting? Last paragraph of Page 10 is misleading: you don’t generalize a definition by adding something and never using it.

Certainly, more interesting are the notions of maximally sound / minimally complete “separations” (they really shouldn’t be called separations). I think in the DL literature, the “minimally complete separations” are often called “least general generalizations” (see e.g. Jung, Lutz, Wolter: Least General Generalizations in Description Logic: Verification and Existence. AAAI 2020 and references therein). The dual (maximally sound) would then be something like “most general ”, but there might be a better term?

Similar remarks apply to the term “characterization”: if you give up that the positive examples are exactly the certain answers, it is not a "characterization" anymore; rather an approximation from below or an approximation from above (but there might be better terminology). Moreover, what is the difference between your notions “complete separation“ and “complete characterization”?

Regarding sections 6 and 7: Usually, the computation problem is harder than the existence problem; so I wonder whether generally the computation problem should come first in the paper. I think the reason why it works in this case is that the sound/complete approximations always exist (so existence is trivial) and the proof that they always exist is given by direct algorithms to compute them. The difficult case is proper separation / characterization, and I believe that the separability / characterizability tests after Corollary 3 can be proved without relying on Theorem 5. I recommend making this more explicit in the paper to help the reader, but my earlier comments are certainly more important.

It was not clear for me what is the mapping language considered in Section 6. In particular, both algorithms MinCompSeparation and MaxSoundSeparation compute in their first step M(D) which is an exponential step at least for GLAV mappings. So what is the running time of the resulting algorithms? Single exponential or double exponential? Is this optimal? Also, the size of the computed queries is exponential. Is this optimal? I admit that some of these questions might be hard to answer (I haven’t thought about them in detail), but I do think that these questions should at least be discussed in a section about the computation problem.

Some minor remarks:
- I wouldn’t use the word “hardnesses”. Rather e.g. “In both cases, hardness / the lower bound already holds for …”
- Remark after Cor 2: two interesting novel results are announced, but only one is mentioned.
- Using \cup as the connector in UCQs is a bit confusing when CQs are written in set notation { … }. I was confused because it looks like a union of sets, but it is not.
- P23, lines 44-47. The problem is not deciding whether phi is “true”, but whether phi is “satisfiable” (appears throughout the proof)
- Def 1: The letter Sigma is often used as a signature in the literature. So, when you write Sigma-separation, it may seem that a separating formula in the signature Sigma is sought. Besides this, I find “Sigma-separation” for an OBDM system Sigma also not very intuitive. It might be better to use “Q-separation of lambda^+ and lambda^- in the OBDM system Sigma”. Since Q is essentially always UCQ, one could argue to drop it throughout the paper. Moreover, in which signature is the separating query formulated?

I think my remarks make clear that the initial five sections should be seriously reworked and clarified. Of course, especially some changes in the preliminaries will affect also later sections, but I do think it is worth the effort, because right now it is very hard to read, the notions (and their names) are just not clear or rather misleading, and the structure of the paper is sub-optimal.