A Survey on SPARQL Query Relaxation under the Lens of RDF Reification

Tracking #: 3410-4624

Authors: 
Ginwa Fakih
Patricia Serrano-Alvarado

Responsible editor: 
Marta Sabou

Submission type: 
Survey Article
Abstract: 
SPARQL query relaxation has been used to cope with the problem of queries that produce none or insufficient answers. The goal is to modify these queries to be able to produce alternative results close to those expected in the original query. Existing approaches generally relax the query constraints based on logical relaxations through RDFS entailment and RDFS ontologies. Techniques also exist that use the similarity of instances based on resource descriptions. These relaxation approaches defined for SPARQL queries over RDF triples have proved their efficiency. Nevertheless, significant challenges arise for query relaxation techniques in the presence of statement-level annotations, i.e., reification. In this survey, we overview query relaxation works with a particular focus on issues and challenges posed by representative reification models, namely, standard reification, named graphs, n-ary relations, singleton properties, and RDF-Star.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Hernandez submitted on 17/May/2023
Suggestion:
Major Revision
Review Comment:

This paper surveys query relaxation techniques on SPARQL, and their relation to reification techniques. While the first part seems to be a survey of the research on the area, the part related to reification techniques seems to be a contribution of this survey authors. Hence, this paper could be better presented as two papers, one surveying relaxation techniques that apply to SPARQL, and another about the limitations of current relaxation techniques regarding reification. The combination of these two parts does not convince me.

The main problem of this paper is that it does not elaborate a high-level description of the most relevant dimensions to be considered to understand the surveyed area, but summarizes some selected papers. Later, it describes the characteristics of the selected papers, but does not elaborate on the organization of these characteristics. For instance, Table 6 compares query relaxation approaches. The first five columns
determine when a query relaxes another query. The first three columns consider logical query containment, whereas column four considers statistical and similarity criteria. Columns six and seven, describe ways to reduce the search space. Finally, the last column determines what are the relaxed queries used to obtain the top-k answers. Therefore, this table shows several aspects of the studied approaches, but does not organize these aspects on the different dimensions of the problem: When a query relaxed another query? How do we generate the suitable relaxed queries efficiently? How do we measure the quality of the results? The attributes of this table seem to describe aspects of the studied approached regarding different questions, but are all put together without a high-level conceptualization of the problem that helps readers to understand it.

= Suitability as an introductory text =

A survey must provide a high-level description of the problems, methods, and evaluation techniques of the surveyed area. My main concern is that this survey does not provide this high-level description, but overwhelms the reader with too many low-level details. The only classification provided for the work in the area consists of two classes: methods based on logic entailment and methods that use similarity metrics. Section 4 is thus mostly an enumeration of the selected papers, instead of a high-level description of the different aspects of the work in the area. For each paper, some low-level details are described, as the number of triples, classes and properties in the dataset, and the range given to a parameter. This presentation does not help the reader to answer fundamental questions as, for example: Are the methods solving similar problems? Are the methods and evaluations comparable? What are the SPARQL fragments covered by current research? Instead of presenting the sizes of the evaluation dataset of each reviewed paper, the survey must introduce a section about dataset sizes, commenting how existing works fit on that dimension. It is more useful to understand the dimensions and then the papers that fit on these dimensions than just having a list of papers with some descriptions that can be used to understand these dimensions.

Several concepts are not defined and make this paper not suitable as an introductory text. Some examples are:

1. The paper mentions the SPARQL operator OPTIONAL (Section 1, Section 4.1.1, Section 4.2.1), disjunctions (Section 1 and Section 4.1.1), FILTER (Section 4.2.3, Listing 2). However, the semantics of these SPARQL operators is not described in the background section. The described SPARQL fragment is limited to conjunctive queries (Section 3.2.2).

2. On page 2, lines 28-35, it is introduced the notion of reification. Then some reification schemas are mentioned. I miss cites to help the reader to find information about these reification schemas.

3. The examples for the issues of reification in the introduction (page2, lines 38-51) are difficult to understand without the background that it is introduced latter. I would recommend focusing on only one issue. For instance, for the first issue (lines 38-41). A reader may not understand what is a query, how it is relaxed by replacing :Query_Language with :Programming_Language. It is also not clear what plays the score in this query, since no score is in the query, nor why one wants more answers without relaxing the scores. The rationale of the problem in this example is missing.

= Comprehensiveness =

There are few works presented, whereas there are more papers about query relaxation in the literature. In page 3, lines 33-34, the authors exclude papers [14-16] because they did not address SPARQL queries. However, paper [16] studies relaxation of relational algebra queries of the form σ_C(R) where C is a conjunction of inequalities. Mapping relation R to RDF, this query is a star basic graph pattern followed by FILTER clauses. Hence, paper 16 can also be considered as a paper that addresses SPARQL queries. The question is what we consider as papers that address SPARQL queries. I do not consider that a paper does not address SPARQL just because it explicitly does not use the SPARQL language. There is a large overlap between SPARQL and other query languages. Moreover, the background section of this paper introduces SPARQL queries as conjunctive queries over databases consisting of binary relations (page 6, lines 32-33). If the authors do so, they should consider all the works on relaxing conjunctive queries in the literature.

= Readability =

Some sections are read as an enumeration of summaries of papers that are hardly understood with all its details. The missing of a high-level description of the problems difficult the reading.

There are also several minor typographic and notation issues. The following are examples:

1. There is a typographic inconsistency regarding identification of resources, sometimes symbols are in italics whereas in others are in roman. For example, on page 4, line 40, b occurs in both, italics and roman.

2. Sometimes classes are identified with lower case letters, and instances with letters that are commonly used for variables (e.g., page 4, line 48), whereas in other cases classes are denoted with uppercase (e.g., page 8, line 2).

3. Section 4.1.4 introduces the concept of Minimal Failing Subqueries (MFS) and Section 4.1.5 denotes Minimal Failing Subqueries as MFSes. A reader my wonder if both concepts the same? The text does not clarify that an MFS is a set of triples, and an MFSes is a set of MFS.

= Relevance for the Semantic Web =

This is a relevant problem for the area because of the heterogeneity and openness of the data. However, the survey would be more relevant if it brings works done outside the Semantic Web community, and that are also relevant to the community. As I commented above, the criteria to choose papers seems so narrow.

= Conceptual Issues =

1. On page 23, line 33, it is stated that standard reification relies on identifying triples with blank nodes. This is not true, since also IRIs can identify triples. Indeed, Figure 17 in [34] shows a triple identified by an IRI.

2. On page 18, line 18, the queries considered in [13] are described as having a different query shape, called "federated query". However, the "federated" characteristic of queries in [13] is not regarding the query itself, but about its evaluation. Hence, this attribute should not be described in the Query Shape column group of Table 5.

Review #2
Anonymous submitted on 24/May/2023
Suggestion:
Major Revision
Review Comment:

# Summary

The reviewed paper is a survey on existing approaches for SPARQL query
relaxation, with a focus on the impact of RDF reification. SPARQL
query relaxation addresses the important problem of user queries that
return none or insufficient answers. This is a common problem when
SPARQL queries are composed manually because it is difficult for a
user to have a detailed knowledge of an RDF dataset and its schemas.

Reification is another important problem related to knowledge
representation in RDF, and hence to SPARQL querying. Because no
standard for reification has emerged yet, reification is generally
ignored in solutions to other problems, like query relaxation. The
most original part of the survey is precisely to analyse the impact of
reification when applying query relaxation approaches.

The survey is structured as follows:
- Section 2: methodology to select the papers covered by the survey, resulting in a selection of 13 approaches in 12 research papers
- Section 3: background that explains the main notions found through the different approaches
- Section 4: decription of the existing approaches in term of the notions explained in Section 3
- Section 5: multi-faceted comparison of existing approaches
- Section 6: focus on RDF reification, presenting the different reification models, comparing them, and finally analyzing their impact on query relaxation approaches
- Section 7: a short section on challenges and open issues related to reification

# (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper does a good job at explaining the key notions involved in
query relaxation. It should be accessible to anyone with a basic
knowledge of RDF, RDFS, and SPARQL.

It provides a unique resource to get a global and up-to-date overview
on SPARQL query relaxations approaches, along with their relative
strengths and weaknesses.

# (2) How comprehensive and how balanced is the presentation and coverage.

To the best of my knowledge, the set of presented approaches is
comprehensive, and those approaches are presented and compared in a
fair and accurate way.

In particular, two main categories of approaches are included, those
based on ontologies and entailment rules, and those based on
similarity of instances. It could have been tempting to consider only
the former, and considering the two makes the survey much more
interesting and valuable.

The survey is also comprehensive on the reification models, including
the most recent one, RDF-star. Although it is not yet a W3C
recommendation, it is the most promising method and its adoption is
rising fast.

# (3) Readability and clarity of the presentation.

The document is well written, and well illustrated with many concrete
examples. I particularly appreciated Section 3 (background), and
Section 5 (comparison). Section 2 (survey methodology) is also fine.

Section 4 has a lot of contents, an effort has been made to give an
accurate presentation of the different approaches without diving too
much in the details. However, it is organized as a long list of
approaches, and each approach is defined in a textual form that is
often difficult to follow. Many terms are put in italics to indicate
that they are technical terms coming from the surveyed papers but they
are often not defined. Here are examples of unclear contents:

- rules applied in reverse order
- matchings that are not matchings of other relaxed triple patterns of t'
- k-relevant approximate answers
- BR studies the problem of relaxing queries as a batch-based process... (paragraph)
- Section 4.2.2: not clear what is compared: the entities in the user query with other entities? the candidate approximate answers with the query constraints?

I would suggest to focus on the key intuitions that make each approach
interesting, to use formal notations when they are clearer than
convoluted sentences (e.g.: matchings that are not matchings of other
relaxed triple patterns of t'), and to use examples to pinpoint key
aspects of each approach (in contrast with other approaches).

Section 5 and its synthetic tables are really interesing. The
organization of comparison criteria make sense and are informative.

Section 6 is interesting but unsatisfactory in terms of organisation,
and contents at some points. First, the paper title says "under the
lens of RDF reification" but reification only comes at the end, when
most of the survey has been done. The section is mostly a micro-survey
on RDF reification, and then one page about its impact on query
relaxation. It would make more sense to have a subsection on
reification in Section 3 (background), and to add reification as a
comparison facet in Section 5 (comparison). The presentation of
reification could also be made shorter by merging comments on RDF
descriptions and SPARQL graph patterns, as they are the same.

Second, although it is true that query relaxation approaches have not
considered reification, 3 reification methods (out of 5) rely on
standard RDF so the approaches can be applied to such reified data. The
argumentation in Section 6.4 is a bit quick in saying that they don't
work or work badly.

- Query size: the increase is not that big, and some approaches have been shown to work on large queries.
- Syntax support: the extension of relaxation rules seems rather immediate for named graphs (from triples to quadruples), and for RDF-star (in <> q v, relax s, p, o, q, and v), even if such relaxations may not be optimal
- Relaxation over annotations: the argument focus on values, which are in my view an orthogonal concern. No existing approach seem to consider values as objects in triples. Here, two concerns are mixed: literal values and reification (metadata annotations).
For me, the problem is rather that reification methods rely on some shapes that can be broken by the relaxation rules.
- Metadata type. Same as above, a difference concern.
- Dataset size: I don't agree that ontology-based approaches are not impacted. They are not impacted in the relaxation process, but they are in the evaluation of relaxed queries, which is actually the bottleneck of those approaches in term of efficiency.

The impact of reification on query relaxation should be studied more
thoroughly. One should ask: what if I apply approach X on reification
model Y? Ideally, this should be studied experimentally. If this is
hardly possible, for lack of source code for instance, a more rigorous
reasoning should be conducted.

# (4) Importance of the covered material to the broader Semantic Web community.

As already discussed in the summary above, the survey cover two
important topics of general interest to the Semantic Web community:
SPARQL query relaxation and RDF reification.

# Minor comments

- Section 3.1: the rules in Table 2 are actually clearer than their
paraphrasing in the text. Maybe the text could be made more
informal, to convey the intuitive meaning of rules, and explain the
more technical aspects of Table 2.

- p6, line 27: avoid the line return for the 3rd case

- Section 3.2.2: the definition for query excludes UNION and
OPTIONAL. Is this intended? Are there not any approach dealing with
them ?

- Section 3.2.2: triple deletion does not appear as a possible
relaxation. Is this equivalent to replacing all terms in a triple by
a variable ? I don't think so. Some of the presented approaches do
perform triple deletion so maybe this should be presented in this
section as a possible relaxation.

- p7, line 48: Sim(C,C') is only when C' is a super-class of C, right?
This should be made explicit.

- p8, lines 4, 14: it seems to me that the conditions "all the
isntances belong to the superclass C'/superproperty P'" are not
required.

- Section 3.3.2: provide a justification for the definition of
Sim(tp,tp'), why an average and not a simple sum as the components
are information content ?

- Section 3.3.3: same, justify the definition of Sim(Q,Q'), why a
product here and not a sum

- p9, line 14: are builT from

- p14, line 29: the short DEPTH

- p17, 26: Section 5 --> Section 5.1

- Section 5.1: comparision --> comparison (several times)
hierarchie --> hierarchy

- p20, Lattice pruning: what about approaches that explore the lattice
from the more general to the more specific. Don't they perform some
significant pruning by stopping traversal as soon as a query with no
(new) answers is encountered?

- Section 6.1: to avoid introducing literals, which are not covered in
previous sections, I would recommend choosing another running
example where annotation values are actually URIs. An example could
be an actor playing in some movie under some role (and possibly
using some language). This would simplify the discourse, and avoid
the filters in the queries.

- Table 8: columns #Triple and # triple patterns can be merged, the
only unsignificant difference being the omission of the type in the
N-ary model (the same could be done for the standard reification
model). Then, c+n expressions should be given, rather than a fixed
value for 1 annotation. Then for named graphs, it should be "1+n
quads" to emphasize that there is no additional "triple" but an
additional term in each "triple". For RDF-star, I don't think it is
correct to say "1 triple". The genuine representation is

s p o {| q1 v1; q2 v2 |}.

which stands for 3 assertions

s p o.
<> q1 v1.
<> q2 v2.

So for me it's still "1+n assertions/triples"

About the overhead variables, I count 1 for N-ary, and 2 for named graphs.

- p27, line 36: rectagnular --> rectangular

- p28, line 44: singificant --> significant

- p29, line 5: meta-metadata --> metadata ?

Review #3
By Louise Parkin submitted on 03/Jul/2023
Suggestion:
Major Revision
Review Comment:

This paper presents a survey of relaxation methods for SPARQL queries, particularly focusing of reification. Regarding the survey evaluation criteria:
(1) The area of query relaxation is covered thoroughly. A lack of examples to explain relaxation methods makes the paper challenging as an introduction to the field. Only reification is presented as an open question, no other challenges for future work are detailed.
(2) The literature review is comprehensive and balanced.
(3) The paper is well written overall.
(4) The subject is relevant to the field of semantic web, and no survey has previously tackled the area of query relaxation. However, the contribution related to reification is not convincing.
Due to the insufficiently convincing contribution regarding the reification aspect, I recommend a major revision.

Detailed review
The interaction between reification and relaxation is presented as a major contribution of the paper. Yet the lens of reification is introduced too late and the paper reads like a survey of SPARQL query relaxation followed by a survey of reification techniques. The statement (in the conclusion) “Query relaxation approaches have been proposed to solve this problem but none of them is appropriate for metadata triple-patterns.” is insufficiently motivated. Indeed, in section 6.4 Syntax support, it is stated that “Queries for standard reification, n-ary relations and singleton properties can be relaxed with existing relaxation approaches.” which I understand as, the existing methods can be applied to reified data, and a relaxation will be performed. In that case, the possible issues are related to performance (query size, and dataset size sections) or to the quality of proposed relaxations (relaxation over annotations, metadata type sections). We are supposed to trust that performance would be significantly worse (this is not experimentally supported) or that relaxations would be nonsensical (no illustration is provided), to the extent of rendering existing methods unusable.
Suggested modifications :
- Show for each existing relaxation method and reification model, which combinations are technically applicable, or could be applicable with minimal modification to the algorithms.
- Implement at least one existing relaxation method for which source code is available on two datasets and queries, one reified and one not, in order to show the performance difference. Ideally, this experimental comparison would be performed with multiple query sizes and dataset sizes.
- Provide for an example reified query an adequate relaxation, and compare to the relaxations that would be proposed by the existing methods. This could show that standard relaxation techniques produce either nonsensical relaxations, or at least that they do not favor the most useful relaxations possible.
Finally, regarding reification, Dellal et al. compare standard reification and named graphs in the context of empty answers in uncertain knowledge bases. There, the only metadata provided is a trust score related to each triple. That work, which is an extension of [10] seems relevant to the discussion of relaxation and reification.

Examples should be added to improve the clarity of the presentation. In particular, examples illustrating the notions of similarity and what relaxation looks like for real queries would be useful, along with examples of how the presented approaches would handle reification.
For a survey paper, I expect further analysis of open questions in this area. Analysis of the papers surveyed shows that relaxation of non-conjunctive SPARQL queries is also missing from existing approaches. Wang et al. discuss the issue in the context of the missing answer problem, but to my knowledge, this has not been applied to empty answers. Furthermore, as the survey aims to provide an overview of relaxation techniques dealing with empty or insufficient answers, a comparison of which techniques deal with empty answers and which deal with insufficient answers is lacking. In fact, most papers cited deal with empty answers, and would require adaptation for insufficient answers. This challenge could also be discussed.

Minor remarks
3.2.1 RDF supports blank nodes for subject and object, why do you not include them in your formalization?
3.3.1 You state that the similarity between classes is undefined when all the instances belong to the superclass C', and none to C, i.e. Pr(C)=0 and thus -log Pr(C')=undefined. I think that this statement is incorrect, as the number of instances in the superclass C’ is not relevant. Rather, if no instances belong to C, then Pr(C)=0 and thus –log Pr(C) is undefined. Likewise for the similarity between properties.
4.1.4 “Intuitively, a relaxed query Q’ of Q fails if and only if at least one MFS has not been repaired in Q’”. This is not the case. It is true that a relaxed query Q’ fails if at least one MFS has not been repaired. However, the opposite is not true. Indeed, all MFS can be relaxed but the relaxed query can still produce no answers. In the following line, it is even stated that a repaired query can produce no results.
5.3 Why not reproduce the experiments for the three papers where source code exists? Could a comparison between these three works be performed? Has there been an attempt to contact authors of cited papers in order to access additional source code? Further experimental comparison would be a significant improvement to the paper. However, due to the paper already being long, I understand that this benchmarking be left for future work.

Possible spelling / grammar issues
4.1.4 Foukou et al. -> Fokou et al.
4.2 Upcoming works means future works, i.e. not yet published, I think this isn’t what you mean
4.2.3 this query may consists in -> this query may consist in
5.2 ontology hierarchie -> ontology hierarchy
5.2 Only the contribution proposed in [2, 3, 5] mention -> Only the contribution proposed in [2, 3, 5] mentions
5.2 Second column -> The second column (same for third, fourth, fifth, sixth)
5.2 This process is relevant because performance reasons. -> This process is relevant for performance reasons.
5.2 Lazy join algorithm -> The lazy join algorithm
6 the reification methods -> reification methods
6 Finally, Section 6.4 sheds the lights -> Finally, section 6.4 sheds light
6.1 To shed the lights on -> To shed light on
6.1.5 Rectagnular box -> rectangular
7 relaxing metadata triple-patterns has not the same goal -> relaxing metadata triple-patterns does not have the same goal

Wang, M., Liu, J., Wei, B., Yao, S., Zeng, H., & Shi, L. (2019). Answering why-not questions on SPARQL queries. Knowledge and Information Systems, 58, 169-208.
Dellal, I., Jean, S., Hadjali, A., Chardin, B., & Baron, M. (2019). Query answering over uncertain RDF knowledge bases: explain and obviate unsuccessful query results. Knowledge and Information Systems, 61(3), 1633-1665.