Order Matters! Harnessing a World of Orderings for Reasoning over Massive Data

Paper Title: 
Order Matters! Harnessing a World of Orderings for Reasoning over Massive Data
Emanuele Della Valle, Stefan Schlobach, Markus Krötzsch, Alessandro Bozzon, Stefano Ceri, and Ian Horrocks
More and more applications require real-time processing of massive, dynamically generated, ordered data; order is an essential factor as it reflects recency or relevance. Semantic technologies risk being unable to meet the needs of such applications, as they are not equipped with the appropriate instruments for answering queries over massive, highly dynamic, ordered data sets. In this vision paper, we argue that some data management techniques should be exported to the context of semantic technologies, by integrating ordering with reasoning, and by using methods which are inspired by stream and rank-aware data management. We systematically explore the problem space, and point both to problems which have been successfully approached and to problems which still need fundamental research, in an attempt to stimulate and guide a paradigm shift in semantic technologies.
Full PDF Version: 
Submission type: 
Responsible editor: 

Revised manuscript, now accepted, after an accept with minor revisions in round one. Reviews for the initial version are below.

Solicited review by Alessandra Mileo:

This vision paper considers the problem of dynamically reason about semantic web data treating them as streams.
Authors argue how exploiting appropriate data management techniques to semantic web could help taking into account the ordered nature of data within the processing phase, with high potential for accuracy, performance and scalability.

Authors provide a list of scenarios where such intrinsic correlation between data and time should be considered in reasoning to obtain better results.
This of course raises several challenges, especially in terms of trade-offs between scalability, run-time response and quality of the answers, as the authors point out.
Approximation (related to better performance) and parallelisms (related to scalability) are considered key orthogonal research dimensions for order-aware reasoning, and some discussion is presented on how reasoning on ordered data streams can benefit from those.

The paper is generally well written and it provides proper arguments for each claim.

I have a few comments/suggestions for minor revisions, related to i) the application scenario and ii) the analysis of related relevant research and techniques.

i) Application scenarios:

It is perfectly clear how real-time analysis would provide better analytics , but I wonder which is the specific collaboration or industry or expert in the application field, with which these aspects have been discussed.
For example: I suppose there are flight controllers used for adapting flight paths and react to emergencies: how do they relate to 2.1?
Also, using real-time reasoning to reform simulation for testing jet engine in a reactive way would safe experimental time and efforts, and probably provide better analysis, but was there any collaboration going on with jet engine designer? Is there any reference to how time-consuming this activity can become and what are the practical aspects of the simulation in which such approach would help?

Despite the application examples can be made more relevant in the presentation, I believe the paper proposes a scientifically valid analysis of the trade-off between the nature of the reasoning needed and the type of ordering that is relevant for the data into consideration, in terms of scalability.

ii) Related research:

The authors relate to all relevant approaches in good detail, illustrating how they relate to their vision, via incrementally analyze one of the example provided as a scenario in section 2.
The paper can be considered a relevant, critical and complete survey on the topic of stream reasoning for big data, where the main challenge is identified in combining data- and query-driven inference to ordered data.
I am surprised that among all the approaches considered, authors do not take into account non-monotonic stream reasoning proposed in

Martin Gebser, Orkunt Sabuncu, Torsten Schaub: An incremental answer set programming based system for finite model computation. AI Commun. 24(2): 195-212 (2011)

and extensions available at www.potassco.sourceforge.net.
Formal definitions for soundness and completeness come from the semantics of the solver, and inconsistencies, incomplete information, constraints and preferences can be managed as well in such systems, and it can potentially be in Area 20, given the novelty of the approach and the fact that it is a current trend of research from the logic programming community.
In stream reasoning via LP, approximation w.r.t. the size of the input is managed via sliding windows, soft constraints can be expressed as preferences or weighted constraints, and the use of heuristics makes the conflict-driven boolean solving algorithm very efficient on NP-hard problems.
Also, I believe there is a potential in the applicability of distributed rule-based reasoning (logic programming) and multiagent systems for parallel reasoning over ordered data.

Although the logic programming approach to stream reasoning has not been concretely integrated with RDF and ontologies, there is an interest on tightening such integration to be able to use more expressive reasoning over semantic data (closed world assumption is an issue, but a few research efforts towards infinite domains is worth being considered).
A few works in this direction have been published, bridging the gap between RDF for knowledge bases for logic programs, as well as logic programming and ontologies.
For this reason I think a very complete analysis would result by relating stream reasoning in logic programming with all the other approaches surveyed in the paper.

Minor comments:
- Green/Yellow/Red squares mentioned on page 5 should be dark grey if the paper is gonna be published in B&W

Solicited review by Denny Vrandecic:

Note on the review: I have originally declined to review the paper due to the fact that I am a frequent and current collaborator with one of the authors (Markus Krötzsch). The editors pointed out that due to the fact that the reviews are published including the name of the reviewer this would be no problem. The review has still to be regarded under this light, and this is why I make it explicit. Furthermore I have to point out that even though I do have some understanding and knowledge about reasoning in general, I do not have any insights in the current state of the art in stream-reasoning or order-aware reasoning, thus further limiting the usefulness of my review.

The paper introduces the idea of introducing ordering as a first-class citizen into reasoning, and systematically analyses the current state of the art in different reasoning approaches and suggests how to extend these. The paper is extremely well structured, and mostly well written (some of the latter sections could use some copy-editing, as noted below). Section 5 contains some passages that seem not relevant to the paper and which I suggest to rewrite or leave out.

Assuming that the content describes the state of the art accurately, I suggest an accept with minor revisions, especially in Section 5.

* Section 2.2 describes that the simulations can take months to create. The further text does not make it sufficiently obvious why a real-time reasoning system is suddenly required, and if materialization would not be sufficient to replace most of the required reasoning in order to allow for the analysis and visualisation of the simulation results.
* Section 2.3: "In the workflow of real-time city surveillance ... has been identified as an insurmountable obstacle." Citation needed. By whom was it so identified?
* I suggest to add the categories from Section 4 into the Figure 3, i.e. grouping the areas graphically as described in Section 4.
* Section 4.3, first paragraph: "... consider ontological background information as a basis for inferring implicit information .... useful for ontology-based information integration, i.e., for handling heterogeneity in the input data." This argument is not very convincing.
* Section 4.4: (In case of inconsistency) "only the most recent statement should be considered." Is this really the only reasonable inconsistency resolution algorithm? Last one wins?
* Section 4.5 contains a paragraph "As this theoretical framework... can be used as a starting point.", discusses the need for new quality metrics besides soundness, completeness, etc. This seems out of place in Section 4.5, as it seems to be something that is encompassing the whole topic of the paper and should be introduced not within the description of a single category of methods, but rather on the higher level, like in Section 3 or 5.
* Section 5.1, the two paragraphs "Various forms of approximation have been ... or existential rules [5]." are indeed talking about approximate reasoning in general, but do not have anything to do with the topic of the paper, i.e. order-awareness. The described methods are also not analysed with regards to order-awareness. As they are they can be left out of the paper completely. The same is true for the last paragraph of Section 5.2, "In recent years... to a streaming scenario". Coincidentally, these three paragraphs contain a third of the papers references.

Minor comments:
* Section 1 speaks about petabytes of data being produced by engine design simulations, page 3, Section 2.2, gets it down to terabytes. Which one is it?
* Section 2.3: "which implies as high labour costs". Remove "as"
* Section 2.4: "Each second, thousands tweets are produced world-wide...". add "of" before "tweets".
* Section 3: "can be enforcedfor". Add space.
* "be produced as result of". Add "the" before "result".
* Section 4: the listing of the areas for the top-k reasoning category (8, 9, 14, 13, 18 and 19) is not order-aware.
* Section 4.4: "natural temporal order in data stream". Add "a" before "data".
* "They share the notion of RDF stream." Add article.
* "However, alternative notion of RDF stream can be explored." Edit / rewrite.
* "Choosing to a triple as streamed data element is appropriate..." Edit / rewrite.
* "thus choosing named graph containing a set of triples as streamed data element". Change "graph" to "graphs". Add "a" after "as".
* "done in [8] - that formally" Edit.
* "Alice can either belong Tall or to Short". Add "to" before "Tall".
* Section 4.5: "that embodies a rank-aware algebraic operators." Edit /rewrite.
* "for an attempt see [49]]". Remove on "]".
* Section 4.6: "theories and methods, which exploits". Change to "exploit".
* "allows for efficient answering queries". Add "of" before "queries".
* Section 5.1: "approximated arbitrarily closely". Change to "close" (unsure myself, though).
* Section 6: "a systematic roadmap for processing ... data" Edit. The roadmap does not process data.

References need to be cleaned up thoroughly. Some references have a publisher (eg 26), some do not (eg 8). Some have the volumes they appear as a reference in a reference (eg 35), some do not (8 and 60, 32 and 40). Reference 12 and 13 are the same. Reference 49 seems broken (page dbrank?).

Solicited review by David Carral Martinez:

As a Semantic Web (SW) researcher with a focus in reasoning I find the paper both stimulating and interesting. It actually made me want to know more about some of the topics presented.

Section 1 gives an appropriate motivation for the use of orders combined with reasoning techniques. As the paper states ordered data, time-critical character, and added value of inferences are requirements shared by a wide range of current applications. Section 2 includes some more detailed examples of these real world applications that could be significantly improved with ordered reasoning techniques.

I think the investigation space classification presented Section 3 (Fig. 3) is quite accurate. Both the division of types of orderings and types of reasoning make a lot of sense. I actually agree with the fact the "non-ordered" categories have been already well studied. Ordered reasoning may be a good step towards a wider use of the Semantic Web technologies.

The coverage of software applications presented in Section 4 is really wide. Reading it has helped me to improve my background knowledge about practical applications in the Semantic Web field and their sequential development (though I might not able to judge this last section very well since I am not very familiar with the information presented).

The paper is also well written and nicely presented (Figures really help!).



This paper was submitted as part of the 'Big Data: Theory and Practice' call.