Semantic answer graphs for keyword queries on RDF/RDFS graphs

Tracking #: 1304-2516

Parthasarathy Krishnaswamy
P Sreenivasa Kumar
Dominic Damien

Responsible editor: 
Thomas Lukasiewicz

Submission type: 
Full Paper
Keyword search is an easy way to allow inexperienced users to query an information system. It does not need the knowledge of a specific query language or underlying schema. Recently answering keyword queries on semantic data repositories has emerged as an important research topic. In particular, many efforts focus on RDF date due to wide spread adoption of RDF for the representation of both open web and enterprise data. RDF and RDFS are popular frameworks for representing data and meta-data for domain knowledge. In this article, we focus on keyword-based querying of RDF data. Current techniques adopted for supporting keyword queries on RDF data suffer from drawbacks such as inefficient path exploration, inability to exploit semantic characteristics and restriction on search within a distance neighbourhood. We highlight the drawbacks of the existing systems with different examples. We present a generic approach which adopts a pruned exploration mechanism, where component sub-graphs are formed using closely related nodes, pruned and joined using suitable hook nodes. The component sub-graphs are enlarged selectively depending upon the relative closeness of the keywords. Semantic property class/subClassOf of the RDF graph is our approach during different phases of the algorithm. A new indexing and ranking mechanism that exploits semantic relationships is also proposed. The working of the algorithm is illustrated using AIFB institute data represented as an RDF graph on keyword queries with different characteristics. The experimental results on other sets namely DBLP and LUBM and a comparative analysis with other approaches are presented. We find that with the help of techniques proposed in this paper, we can give accurate and relevant answer graphs for several keyword queries where existing techniques fail. This is mainly due to the fact that the proposed techniques has no restriction on mappings and also exploit the semantic relationships from the underlying RDF/RDFS dataset.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christina Unger submitted on 14/Mar/2016
Major Revision
Review Comment:

The paper presents an algorithm for mapping keyword queries to RDF graphs, which includes a few improvements over existing approaches, especially over [14] and [15].

I'm not entirely sure about the claimed contributions though. The paper starts saying that existing approaches do not "exploit semantic characteristics of the RDF graph", although [14], for example, also exploits the types of edges (in particular type and subclass edges). Also indexing is mentioned as a challenge, but there are approaches (e.g. Le et al., cf. using Hexastore-like indices for efficiency. At another point it is mentioned that "pruning unwanted nodes [...] is the fundamental essence of the approach adopted", but pruning is a part of a lot of other approaches as well.

In addition, it should be stated in which respect the paper goes beyond earlier publications of the authors.

The approach in general seems valid and sound; however, the presentation does not yet meet the standards of the Semantic Web journal. A lot is unclear or hard to follow; I include some more detailed comments below. For example, on several occasions the presented approach is called a "hybrid" approach, but in which sense is it hybrid? And with respect to grammar and spelling, the paper is in pretty bad shape. Most striking is "Philip Camiano" occuring throughout the paper.

Regarding evaluation, the algorithm was tested on datasets with very small schemas, so I wonder how well it would scale to datasets like DBpedia, where the schema is considerably bigger (and noisier), giving rise to a lot of ambiguity. How well could the approach handle ambiguities and the resulting huge amounts of potential search queries?

In the following some more detailed comments:

Definition 1:

* Slightly longer names for nodes (e.g. Class Nodes, Entity Nodes, Literal Nodes) and edges would help. For example, the description of "Component Subgraph Creation" is really hard to read, as one has to keep in mind what EN, EA, IE, etc. stand for.
* And I wonder why you need different kinds of edges. Do you ever use this distinction for anything? It seems easier to me to just talk of edges between nodes, which would make the labeling function much easier as well as a lot of descriptions in the text.

Figure 1: The arrows for the property "year" should be the other way around, I guess?

Page 3:

* The following is not clear: "For the query{publication,1999}, [15] does not function as it deals with RDF graphs with only data nodes."

Page 4:

* It is not clear to me how the summarization technique loses "too much information". What is wrong with building a graph that captures the information need also when there are no answers?

Page 5:

* Can a relationship chain include inverse relations?
* In the definition of a class chain, shouldn't D-node and C-node be swapped? (CC_n is a C-node if RC_n is an IE-edge, and a D-node it it is an EA-edge.) Also the definition is missing how you actually get CC_n.
* SQ is a pair, not a triple.
* You define answer graphs as subgraphs s.t. every keyword is mapped to a node or edge. What is the reason for this? For example, if you have a query like "number of people living in Boston", it might be necessary to map "living in" to a property "population", while "people" does not map to anything.

4. Algorithm Description:

* The algorithms miss a caption and thus numbering, so it's not immediately obvious what Algorithm 1, 2 and so is.
* Please add a short prose explanation of the create-answer-graph procedure.
* A few parts are left undefined. For example, Algorithm 1 uses a compute-subgraph-sets-for-pruning and a compute-subgraph method, which is neither defined nor explained. Also, in Algorithm 3 you use a tilde+slash as relation between g and h -- from the text it's clear what is meant, but in the pseudo code it should either be defined or replaced by prose. Also, for the definition of ACNS you use notation that is not intuitive enough to not require any mention.
* In general, your abbreviations are not very mnemonic, e.g. ICNS, ACNS, FNCS, PNCS in Algorithm 3, which makes it harder to follow than necessary. Please consider something like CNodes_active, CNodes_pruned, etc. or similar.
* During Subgraph Enlargement you add half of the nodes to one graph and half of them to the other; why is that? Why not add all to one?
* In the hook-graphs method, you say CS_new = merge{g,h} and PRGS = merge{g,h} and PRGS = PRGS union CS_new, so PRGS = merge{g,h} union merge{g,h}? Also, I missed a definition of "lowest node cardinality" and "minimum relationship chain". (The latter probably just means the shortest one?)

Page 9:

* You say "where chnl = ... + 1", but the formula also contains chnl + 1. I guess one "+ 1" is too much.
* The score TR is only explained as "standard IR approach"; please add at least a reference, even better a brief definition or explanation.
* The node relevance is a fixed value?


* You always talk about RDF fragments used for the examples -- I guess this is only for presentation purposes? Or would the component subgraphs look differently when using the whole dataset?

The organization of section 8 is not optimal, in my eyes. For example 8.3 I would put into the architecture description, while the second paragraph of 8.4 seems to belong to implementation, not evaluation. I would even separate a section on design and implementation from a section on experiments and evaluation.

Page 16:

* In the query, the first filter regex should be ".*Philipp.*" not ".*XMedia.*".

Review #2
Anonymous submitted on 15/Mar/2016
Major Revision
Review Comment:

In the paper "semantic answer graphs for keyword queries on RDF/RDFS graphs" the authors present their approach to extract sub-answer graphs from given RDF data.
The authors should definitely check the spelling and punctuation in the paper, there are many totally unnecessary errors.
Also, some times the reference to a Figure/Graph is missing, for example on page 12 "For the query, the relationship chain as shown in ?? is obtained"

The overall presentation of the paper is bad. While the idea, which is presented, may be great, the authors fail to present it in an easily comprehensible way.
The paper would definitely benefit from a better introduction with at least a clear formal definition, as well as a practical example.
In the algorithm section, I would propose a different structure of the content. For example in the subsection "Component Subgraph Creation",
here it would be great if the authors first state the input and output and then describe the Algorithm while pointing to the exact lines in it.
E.g "If the elements is associated with a C-Node (see line xyz)...".
Please also reference to the different Algorithms in the text and add captions to each Algorithm.
The use of many listings decreases the reading flow of the paper.
In general, all chapters need a major revision.

Review #3
Anonymous submitted on 25/Mar/2016
Review Comment:

This paper addresses the problem of answering keyword queries over RDF data. The authors claim that the proposed approach is more effective than the state-of-the-art approaches. Unfortunately, it is very difficult to appreciate the contribution since the presentation is very poor. It is full of typos (see below for some of them) and of awkward sentences and it lacks formalization. I think that, in its current form, the paper does not meet the standards of SWJ.
Typos and (some) specific comments:
1. Abstract: “AbstractKeyword”
1. Abstract: “Semantic property class/subClassOf of the RDF graph is our approach during different phases of the algorithm”.
2. Abstract: “This is mainly due to the fact that the proposed techniques has no restriction” -> technique
3. Introduction: “In these class of applications” -> classes or this
4. Introduction: “Existing approaches for keyword queries on RDF data… does not exploit” -> do
5. Line 4 of Create-Answer-Graph: function Compute-Subgraph-Sets-For-Prunning invoked there does not exist.
6. Line 12 of Subgraphs: function compute-subgraph does not exist.
7. Which functions invokes Create-Component-Subgraph? I assume that it should be named Compute-Subgraph-Sets-For-Prunning.
8. Why do lines in function Subgraphs start with number 10?
9. The example of similar node is unclear: why do you first mention Techreport and then add Journal to the set?
10. The definitions of C-Node set and active C-Node set are unclear. Specifically, {CCi,CCj} denotes, on the basis of what said in the Pruning paragraph, a ‘pairwise component subgraph’, and not a (set of?) similar node.