Data Access over Large Semi-Structured Databases

Tracking #: 603-1811

Bruno Paiva Lima Da Silva
Jean-François Baget
Madalina Croitoru

Responsible editor: 
Pascal Hitzler

Submission type: 
Full Paper
Ontology-Based Data Access is a problem aiming at answering conjunctive queries over facts enriched by ontological information. There are today two manners of encoding such ontological content: Description Logics and rule-bases languages. The emergence of very large knowledge bases, often with unstructured information has provided an additional challenge to the problem. In this work, we will study the elementary operations needed in order to set up the storage and querying foundations of a rule-based reasoning system. The study of different storage solutions have led us to develop ALASKA, a generic and logic-based architecture regrouping different storage methods and enabling one to write reasoning programs generically. This paper features the design and construction of such architecture, and the use of it in order to verify the efficiency of storage methods for the key operations for RBDA, storage on disk and entailment computing.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Feb/2014
Major Revision
Review Comment:

This is paper that is easy to read, but not easy to follow.
In fact, I still have not been able to classify is as either
a description of a tool, a proposal for new architecture, or
simply a research paper (in which case is not easy to determine
the focused contributions).

The authors made the following claims:
- In this work we will study the elementary operations needed in
order to set up the storage an querying foundatons of a rule-based
reasoning system [...]
This paper features the design and construction of such architecture (abstract)
- The problem addresses in this paper is the Ontology-based
data access problem (p. 1)
- [...] we are able to set as a goal to obtain a system that
would be able to perform conjunctive queries over KB of any size
and structure (p. 8)
- we propose a software platform (p. 7)
- one of our main contributions of the software is to help make
explicit different storage system choices for the knowledge
engineer in charge of the manipulation of the data (p. 9)
- In this paper we have presented ALASKA, the software architecture
we have designed and we have used to study the efficiency of
existing storage systems for the elementary opertions of
conjunctive query answering. (p. 24)

If I understood well, we have to evaluate a software system
(more than an architecture), whose requirements are set.
Let us analyze secton by section:

Section 2. a descriptive and background section.
The paper begins at a rather abstract level, wih many definitions
and background. One learns in section 2.2.1. that the paper
"will focus only in KB stored on disk".

Section 3. A descriptive section.
This is a narrative description of the system ALASKA.
There is not much one can comment on it.

Section 4: An example.

Section 5: This section makes the link of ALASKA with databases,
by describing how to different KB are stored. The most interesting
part is subsection 5.2.3 Challenges:
- Memory consumption
- Garbage collecting
Unfortunately they do not tell us nothing concrete, but a general
description of experiments made (no details), and some general
conclusions derived from them.
Subsecction 5.2.4. Contribution, on the other hand, tells us why some design decisions have been made.
In 5.2.5 Results, a comparison with other system (Jena in particular)
are mentioned, but nothing conclusive.

Section 6: Querying.
Only time is tested. Memory consumption (a crucial fact when
considering large data sets taht do not fit in main memory) is not considered (there is no convincing argument why).

The experiments in 6.2 did not convince me. The first paragraph
"After a first battery of tests, in which neat conclusions were
difficult to obtain, we have focused in the adaptation of our
problem into a CSP problem and the integration of a CSP solving
program in order to address conjunctive query answering".

In 6.2.2., the queries selected for experiments (items 1 to 4)
show only "star" queries. The main problem, as it is well known,
are not these queries, but join queries, e.g. of the type
p(x,y) AND q(y,z) AND r(z,w), etc.
Thus the experiments do not tell very much about the strength
of the system.

Section 7. Discusion and Future Work.

After the material presented in previous section, there is not
much one can add. In my opinion the paper needs more focus, to
concentrate on one or two topics, e.g. to show why ALASKA is
needed, why and in what scenarios it outperforms other systems,
and most importantly, to design and run good experments to prove
the claims.
Another focus could be the subtitle: " A generic approach towards
rule-based systems". In this case, wuold be good to devise a sort
of benchmark. The queston that comes over and over when reding the
paper is: what are the main problems or challenges of such systems,
and why and how alasks solves them, or address them better.

Details of form:

1. Tables and figures font should be enlarged (e.g. p. 4, 7, 16, etc.)

Review #2
By Ernesto Jimenez-Ruiz submitted on 06/Apr/2014
Major Revision
Review Comment:

The authors present ALASKA, a generic platform for RBDA (Rule-Based Data Access). ALSAKA aims at being use to verify the efficiency of storage methods for the key operations for RBDA, storage on disk and entailment computing.
ALASKA does not yet integrate any ontological reasoning features, nor integration of "lite" description logics families such as DL-Lite or EL families.
The paper presents and evaluation using 1 relational database, 2 graph database and one triple store.

The presented approach seems interesting but, as discussed by the authors in the conclusions, it is in an early stage and they still have many open lines for future work.
Furthermore, I do not think the paper is completely suitable for the "semantics for big data" special issue since ALASKA does not seem yet to have an infrastructure nor a clear research line/motivation to address (some of) the big data problems.
Additionally, the paper is not easy to read and follow in some sections since it tends to be too descriptive. A bit of rewriting and organization is required. The paper also contains many broken references or missing references to algorithms, sections, etc. A proof read of the paper is necessary.

The abstract gives more or less an overview of the purpose of the paper, the introduction is, however, less satisfactory.
In the introduction one can find in different points sentences as:
- The problem addressed in this paper...
- The contribution of this paper...
- For this reason we will focus...
A wrap paragraph summarizing the addressed problem, clear contribution and main focus will help. A motivating scenario would also help.
Section 1.1. should at the very end of the introduction. Section 1.2 could be integrated in Section 2 (or Section 2,2) or be a Section itself.
Section 2.3. gives some notions of motivation and novelty, I think this should also be emphasized in the introduction. The "main" focus of the paper should also be clearly stated (e.g in page 7, it is mentioned in two points two different focus of the paper. Perhaps focus is not always the most suitable word). Also in Section 6, authors try to "define the purpose of using ALASKA as a querying interface", this should be clarified from the very beginning.

The title may also lead to misunderstandings since one tends to read only the firs line, ignoring the "A generic approach towards rule-based systems".

In section 3.4.2: the semantics for predicates with arity > 2 is not clear. e.g. for a predicate hasTwoParents(givenChild, parent1, parent2), should not the methods return both parent1 and parent2 as terms connected to given child? Unfortunately, section 4 does not give an example with arity >2, either.

In Section 7 authors emphasize the lack of comparison with SPARQL as a current limitation of ALASKA. I think indeed this is an important limitation. I may be wrong, but one could think that converting any input database to a triple store and using standard OBDA over that triple store would be much easier than the approach presented by ALASKA. Furthermore, there are already many systems that has been proven to scale with very large amounts of triples. A further comparison will help to understand which are actually the strong points of ALASKA.
Suggested paper:

[1] Markus Krötzsch, Sebastian Rudolph. Second-Order Queries for Rule-Based Data Access. In Institute AIFB Technical Report 3019. Karlsruhe Institute of Technology 2011.

Other comments:
- Page 5, sect 2.2.3: and other -> and otherS
- Page 5: "One of the motivations of ALASKA features is the big addition of facts due to forward chaining rule application. This is the reason why in this work we do not focus on backward chaining". Not sure if the meaning of this sentence is clear.
- Page 7: "Prolog anb CoGITaNT are the only systems previously described..." Their first and onlyoccurrence is in Table 2
- Page 7: broken sentence: "In the list of the main factors that have led to this situation, the emergence of very large and unstructured knowledge base"
- Page 8: an user -> a user
- Page 9, paragraph 1: that -> which
- Page 10: end line first columns, remove "."
- page 11: Section 3.4 starts with the sentence "As previously mentioned in Section 3...". Since we are still in Section 3 a concrete reference to a subsection is required.
- Page 11: ... using those methods TO ensure...
- Page 11-12: addAtom function introduced in last section? Something it is mentioned in section 3.2, but I think it has not been properly introduced yet.
- Page 12: which are the first and second algorithms presented?
- page 12: "enumerate" function presented in section 3? Where exactly?
- page 12: from what it describes the text it should be P(a, X) instead of P(X, a)
- page 16: Algorithms 3 an 4 are referenced as 5.1.1
- Page 17: Algorithms 5 and 6 are referenced as 10 and 5.1.12!
- Page 19. Again wrong reference to Algorithm 8
- Page 21: The numbers and order of algorithms is broken.
- Please proof read the paper, there are many small issues that makes really hard to follow the paper.

Review #3
By Steffen Lamparter submitted on 09/Apr/2014
Major Revision
Review Comment:

In the article "Data Access over Large Semi-Structured Databases" the authors describe in detail their ALASKA framework ("Abstract and Logic-based Architecture for Storage and Knowledge bases Analysis"). The frameworks aims at a common architecture for rule and description logics based knowledge bases. In the first place, it presents a unified formalism for accessing different underlying storage systems via conjunctive queries. Second, besides outlining the unified language and architecture the paper addresses in particular the design of a system for rule-based data access (RBDA). Third, the paper shows how the proposed framework can be used to compare the efficiency of different storage systems (such as relational databases, graph databases and triple stores) in answering conjunctive queries.
In general, the problem of combining rule and description logic based systems is of course highly interesting and of huge practical relevance. For example, in context of knowledge-based management of technical systems (e.g. system diagnostics) often rule as well as DL based approaches are required (and also often used) in parallel. While rule based systems are well suited to realize proactive alerting mechanisms, descriptions logics is more suitable to express complex diagnostic knowledge. Also analyzing the efficiency of different storage systems with respect to answering conjunctive queries is indeed an interesting topic. The idea to have a platform that allows direct comparisons of different storage systems without much effort is appealing.


After reading the article I have still serious problems in understanding the real contribution of the paper due to several reasons:

1. Is the contribution the unified framework for rule and DL-based systems, is the contribution a novel RBDA system (only supporting rule-based reasoning) or is it about an architecture for different backend systems? Maybe there is a good reason for discussing all these topics in one paper but this should be explained much better.
2. Within the abstract the authors say they study storage and querying foundations of rule-based reasoning systems and in 1.2 they describe that the problem of RBDA is "finding answers to a query which can be deduced from facts, eventually using inference allowed by an ontology". However later on (page 9) the authors write that ALASKA "does not yet integrate any ontological reasoning features". For me it does not become really clear what is implemented and what is not. It is also unclear if there is any integration with DL-based OBDA systems (which is claimed as the major contribution).
3. The title mentions data access to "large semi-structured data", but semi-structured data is not discussed within the context of the ALASKA platform.
4. It is unclear why the authors provide all the logical definitions and examples in section 2. Why not simply reference existing literature? Why are all these definitions formally introduced although not required later on in the paper? Is there anything that differs from the standard notation? It is even more confusing that in 1.1 the authors write that in section 2 the "problems and motivation of the paper" is presented while they actually discuss logical foundations.
5. While one can understand that conjunctive queries can be transformed to a Constraint Satisfaction Problem, it is unclear why CSPs are introduced in the end. Should this be an alternative approach?

Maybe it would be better to focus on a single aspect/contribution in the paper (e.g. only discuss RBDA in detail or focus on comparing of different backend systems) and streamline the paper in this direction.

Application and Evaluation of ALASKA:

In section 6.2 the authors show how ALSKA is used to compare different storage systems such as relational databases, graph databases and triple stores with respect to answering conjunctive queries. Data is loaded into the different stores according to the elementary operations described in previous sections. The authors use a set of four conjunctive queries and measure the results on databases with increasing data volume.

However the comparison is not completely convincing due to:
- Four different queries are not representative
- Too simple queries: only up to three conjunctions
- The knowledge base is very small: An RDF file with 100000 triples is only about 10 MB
- The authors write that they do not use indices for the relational and graph databases. But that is an important feature of these storage systems and one of the main reasons why they are used.

In addition, the experiments described in the paper should be much more convincing in evaluating the contribution of the paper. On page 2 the authors write that the contribution of the paper is a “generic architecture (ALASKA) … for enabling storage and querying of large KBs on disk avoiding out of memory limitations”. However they do not show that their system is actually able handle such large data.

Further open questions:

- Within the examples provided and also within the evaluation information was structured not semi-structured. How does ALASKA deal with semi-structured data?
- There is no description of related work? E.g. there is already a lot of work about unifying DL and rule based reasoning approaches (e.g. Boris Motik and Riccardo Rosati. Reconciling Description Logics and Rules. Journal of the ACM, 57(5):1–62, 2010.)
- How does the approach make sure that efficiency is not influenced by ALASKA? E.g. in figure 11, I'm wondering whether directly loading the RDF file to the triple store results in the same representation in the triple store?
- In section 3 the authors describe ALASKA and the elementary operations or "core functions" of ALASKA are translated and applied to the different storage systems. For instance the authors explain how all atoms with a given predicate are retrieved from a graph database or a relational database. Why are triple stores missing in sections 3.4.1., 3.4.2. and 3.4.3.?
- The authors state that in RBDA/OBDA there are in principle two different strategies: forward chaining and backward chaining. Later on (page. 5) they write that they do not focus on backward chaining. Why?
- In section 4 and 5 the authors describe how facts are written to the different stores using ALASKA. However, the analysis of storage efficiency is very general and abstract. E.g. "A few experiences have shown that some parsers/methods use more memory resources than others while accessing the information of a knowledge base and transforming it into logic." To get some valuable insights a much more detailed discussion/comparison would be helpful.

Minor comments on style and formatting:

- The overall structure does not help the reader to get the contribution of the paper.
- Sometimes (e.g. section 1.1) the authors refer to the problem definition in section 2. But it should be section 1.2
- English could be improved:
* 2.2.1: "has become" not "has became"
* plural of "triple store" is "triple stores" not "triples stores"
* 5.2.3: "depends on" not "depends of"
* p. 10: "Other systems" not "Others systems"
* 4.: "…be explained by means of an example" not "through the means of an example"
* ...
- Charts in section 4 are not readable in black/white copy
- Listings of algorithms could be bigger
- Several times lines are longer than the rest of the column (e.g. p.1, p.12, p.14 …).