Semantic Object Mapping using SHACL Validated Resource Graphs

Tracking #: 2656-3870

Authors: 
Sarah Bensberg
Marius Politze

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Abstract: 
Metadata describing research data is a core instrument for compliance with the FAIR Guiding Principles. RDF is very suitable for this purpose because its schema-lessness provides flexibility for changes and different disciplines. In order to follow defined metadata schemas, RDF needs to be restricted by application profiles. SHACL allows validating whether an RDF based metadata matches a certain shape and meets minimum quality requirements. Described research data hence becomes resources in a validated knowledge graph. Searchability is achieved by SPARQL, which, however, requires considerable technical knowledge. The presented semantic object mapping of validated resource graphs into an JSON object is an approach which is less dependent on the users' background. The evaluation was performed based on precision, recall and response times using Elasticsearch as a search engine on the mapped object in comparison to generated SPARQL queries. The results show that with the transformation of RDF based (meta-)data into a search index using application profiles and inference rules, a solution was found that is equivalent in these terms. Through the integration of the developed mapping in the data management platform Coscine, a search of the research data is possible which at the same time promotes the subsequent use.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ben De Meester submitted on 09/Feb/2021
Suggestion:
Reject
Review Comment:

My review is in two parts: (i) general, high-level remarks, and (ii) detailed remarks per section (if relevant)

Comments when reviewing the criteria:
- originality: after reviewing the related work, it is not clear what this work provides more or different, except for different choice of technologie (i.e., SHACL) to hint a test index
- significance of the results: as I do not feel the evaluation is adequate, it is hard to correctly estimate the significance of these results
- quality of writing: there are many, in my opinion, irrelevant pieces of text, and argumentation is often lacking

- The motivation is lacking. Much space is spent on (in my opinion) irrelevant motivations, but no reasoning on why a search index, why Elasticsearch, why structured search is given. I would suggest replacing your motivation with a stronger one
- There are multiple cases where, in my opinion, an irrelevant discussion is added (e.g., the argumentation of "why coscine", the discussion of "OWA/CWA" and "ontologies" in section 2). This makes the paper unnecessarily long and moves the attention away from the actual content. I would suggest killing your darlings so that the journal is much more to the point and thus brings a stronger message. Especially the Background section is too long for me.
- The evaluation is severely lacking, not only due to its irreproducibility. In the related work, there is no hint that a performance comparison to SPARQL is the right kind of evaluation. I would expect at least a functional comparison between the state of the art and your approach: in section 6 you claim certain functional differences between related work and your approach, but they are not proven in the paper.
- It is very much unclear what the purpose of the SHACL shapes is. The paper gives the feeling that it is for _both_ validation and search indexing, but I am not convinced all shapes can be used for both purposes. If not, this limits the solution as specific shapes need to be made for the indexes. That is not necessarily a dealbreaker, but it should be clarified.
- In general, I think this paper would be much better if the authors would more clearly identify the potential weaknesses of their approach throughout the paper, and more complete related work section, and a more relevant evaluation

Feel free to contact me if you would have any questions or remarks on my feedback :).

ben (dot) demeester (at) ugent (dot) be

#### 1. Introduction

- I do not believe [11] is an acceptable citation to the claim "access to research data and related services is usually limited to one project or organization".
- "This results from the different research data infrastructures used by researchers and consisting different services" --> "consisting of" or "comprising"
- In general, the paragraph that starts with "An important aspect that is relevant for the implementation of appropriate solutions [...]" is hard to follow. It is not clear what the argument is, nor what the arguments are.
- I am unsure about the value of [12] for the claim "the added value of RDM is often indirect, so active integration into research processes is crucial". I have the feeling [12] rather arguments that "Tailoring an IT infrastructure to the specific requirements of certain disciplines is crucial for their short-term success". This could probably be solved by rephrasing.
- Page 2 is argumentation for Coscine, not for the paper at hand. I don't see the added value of this page.
- "This meets the requirements of a structured search as well as advanced search types, e.g., range queries" --> It is not argumented why this is important
- "The crucial question is how to transform the Coscine knowledge graph into a search index" --> The motivation section in its current form does not argue why a search index is needed.
- "This is especially important because it is to be benefited the from structure and inference rules, i.e., from the advantages of the RDF model." --> This sentence is unclear: are words missing?
- In general, I am missing the relation with non-semantic web applications: why does the author not discuss relevance with, e.g., SQL applications? How is it solved there? (especially since your approach does not seem to take advantage of the graph)
- "which is relies on" -> "which relies on"

#### 2. Background

- "The use of different metadata schemas is possible" --> you didn't clarify the connection between application profile and metadata schema: is an application profile a type of metadata schema?
- "Along with the Closed World Assumption (CWA) a system with complete access to all information about a subject exists" --> please rephrase
- concerning CWA vs OWA: it is not specified what you assume in your application, I think you assume CWA, as you use application profiles (apparently this comes back in the conclusion). I would have appreciated it if this was clarified in advance. Right now, this section is close to irrelevant.
- "Ontologies only allow extracting conclusions about existing data" --> Given you include this discussion (and example), I would have appreciated the mentioning of related work where ontologies _are_ used as restriction languages, such as [Motik2009] and [Tao2010]. And then of course a small argumentation why those are out of scope :). Or you don't discuss ontologies (also given they are further no longer discussed in the paper) and make the paper more succinct.
- "The result is a system of semi-structured data due to the fact that on the one hand there are certain (structured) fields and on the other hand there are different SHACL application profiles with different fields, vocabularies, and structures (unstructured)" --> I don't understand this argumentation. Especially the "structures (unstructured)" part needs explaining.
- I don't understand why the description of the automatically generated forms (so roughly page 7) is relevant for this paper.

#### 3. Resource Object Mapping

- It is unclear to me whether the shapes in your example are part of your validation shapes set or not. I have the feeling they are separate, no? Otherwise, the shape of example 8 would require that every person should have a label that is either its givenName, familyName, organizationName, or UnitName. Could you clarify that in this case having your UnitName as your label constitutes as validated data?.
- "As can be seen from the three examples, SHACL rules can be used to map literals and structures of any complexity." --> _any_ complexity? Can you be more specific on how you have measured this "any"?
- All the different examples belong in an appendix in my opinion. I would prefer the exact complexity description and a short discussion of what this complexity entails. And if you cannot give this exact description, I'd like to read why not (and that's not necessarily a problem).
- 3.1 and 3.2 are a description of SHACL rules. I do not see the added benefit of including these in this paper (yet). You should argue at the beginning of 3. why you discuss specifically these two types of rules.
- Can inverse relations be taken into account? What about blank nodes? Cyclic relations? Can a part of the URL be used as a value in the JSON object (e.g. http://example.com/city/Paris -> "Paris")? How can you distinguish between a URI mapped to a string, and the same string (and is that important)? Instead of discussing 5 examples at length, I'd prefer a table of which functionalities are supported, and which are not. Also, an important research question is: which RDF functionalities do you want to map? I'm afraid I didn't find an answer to that question (yet).
- What problems arise because of the mapping not being injective? Can you mitigate, or is it not that big of a deal? I would like to see a small discussion on the impact of this.
- "the same properties are defined in different SHACL shapes with the same data type or are described using instances of one class" --> if this is not the case, don't you have the problem then that your SHACL shapes are inconsistent and your validation will always fail (e.g., if two shapes describe an object from path `ex:path`, one says it's a datatype string, and the other a datatype boolean, the data will never be valid because at least one of the shapes will always fail). This strengthens my first point of this section: are the shapes in your example part of your validation shapes set or not?

#### 4. Evaluation

- "The following results are based on the assumptions made in the context of the paper, the presented test setup, implementation, and the generated sample data." --> I don't understand what you are trying to say with this sentence.
- How do you get the list of search intentions? What guarantee can you give us that this is representable?
- "Also, it is assumed that the user searches for the correct words, i.e. those that are present in the text and not for variants or synonyms of them." --> Why? One of your main motivations for having a search index was "the handling of grammatical differences or spelling" (section 3.2)
- "This is not shown in the example queries or data but would be a disadvantage" --> why?
- "The resource object mapping reaches the same values as generatedSPARQLqueries and is therefore not inferior." --> I don't see proof of that. Don't you then need the confusion matrices from the SPARQL queries as well?
- Please clarify what you mean by Guideline. I assume this is the baseline with the SPARQL queries? It is however not clarified.
- "Table 6 shows the factor between the respective values of the average times from the tables 4 and 5." --> Could you please discuss what this means? How is this benchmarking your method? It feels like benchmarking the SPARQL endpoint and ES, instead.
- "it is clearly visible that creating, updating and deleting is faster if the other metadata records are not affected" --> This is an interesting conclusion.
- I understand it is not necessarily needed to publish any source code, but I would have appreciated some hints at the reproducibility of your evaluation, i.e., at least the queries and results in published a (semi-)structured format. Right now, I would rate your reproducibility very low.

#### 5. Related Work

- "This is done with an ID describes the entity" --> A word is missing
- "However, the model serves only as a structure" --> What does this mean? How is this a disantvantage compare to your approach, why can't you reuse Delbru's prior work?
- "However, it is outdated, since ES no longer supports the use of different types in a search index [53]" -> Well, true, but your reference also mentions several alternatives that could be used. Why are those unsufficient?
- "URIs would be indexed for which the users do not search" -> Is that so? How do you know?
- "In this approach the associated labels of the RDF annotation to URIs for indexing data in Solr are stored" --> This makes me wonder why -- in your related work -- there's is no clear section on search engine platforms and how you came to the conclusion to use ES instead of Solr for this use case.
- "the possibility to use ES to benefit from advanced search types like value range queries" --> But you didn't showcase range queries, did you? Only queries 'bigger than/smaller than', or is that also deemed a range query? In any case, more clarification is needed on which ES functionalities you did and did not use, and how these compare to the alternatives.
- "In the context ofCoscine, the domain is not known in advance or can change by further application profiles and application areas. Thus, such an approach is not suitable" -> But, in your case, SHACL shapes must be created for the search indexes. How is that effort-wise different from weighing the paths?
- The related work section gives me the feeling that a flexible reasoner to generate an index-graph + SPARQL endpoint would support all your use cases (since you don't compare fuzzy matching).
- Why is there no comparison with all these SPARQL endpoints that also provide search indices? E.g., GraphDB supports full-text search https://graphdb.ontotext.com/documentation/free/full-text-search.html: why is this not sufficient? I assume for the needed inferences?

#### 6. Conclusion and Future Work

- "This means that all knowledge or structures must be aware of and considered in advance" -> please rephrase
- "the provided search syntax of ES is very simple and intuitive and no further knowledge is required" --> I would argue that the SPARQL syntax is very simple (for people experienced in SQL), and that for both, knowledge on the data structure is required.
- In general, this section does provide a lot of answers to questions I had when reading the paper. I would appreciate if the general tone of clarifying the downsides of the proposed solution was used throughout the paper.
- I understand that you could not do a user evaluation (and personally, I don't think it's absolutely necessary to prove your point, you don't want to evaluate ES, you want to evaluate an approach), but your closing remarks make it clear that a functional comparison with the state of the art is very much needed.

[Motik]: Boris Motik, Ian Horrocks, and Ulrike Sattler. Bridging the gap between OWL and relational databases.Journal of Web Semantics, 7(2):74–89, April 2009. DOI:10.1016/j.websem.2009.02.001.
[Tao2010]: Jiao Tao, Evren Sirin, Jie Bao, and Deborah L. McGuinness.Integrity Constraints in OWL. In Maria Fox and David Poole, editors, Proceedings of the 24th AAAI Conference on ArtificialIntelligence, pages 1443–1448, Menlo Park, California, July 2010. AAAI Press.

Review #2
By Ognjen Savkovic submitted on 11/Feb/2021
Suggestion:
Major Revision
Review Comment:

# Summary

In this work, the authors are addressing the problem of search and data retrieval of scientific data. In particular, they assume scientific meta information (info about data) to be in RDF format. Then they validate and enrich this RDF data with SHACL rules and transform the data via mappings into JSON format. The JSON format allows for easier search using existing Elasticsearch engine that allows for more intuitive search but also ranking of the search results; as opposed if such search was done using SPARQL. At the end, the authors evaluate they proposal comparing it to SPARQL queries.

Searching and querying RDF data is probably difficult problem for non-SPARQL experts so overall so the problem addressed in this work make sense. Still, I believe that 1) quality of writing as well as the 2) the originality of the solution are not at the acceptable stage.

In the following, I will first discuss a general critique of the work and then pin-point smaller remarks on the language and terminology.

# Main Comments

M1. Quality of writing.
The presentation of the work is very difficult to follow and for me it was only possible to understand the problem addressed in this paper by reading the conclusion (I believe that conclusion can be moved to introduction.) First, already abstract is not being clear. Often the terms are introduced without being explained beforehand. E.g., "searchability", "resource", "semantic object mapping", etc. Many of this terms could be interpreted in many ways (see comments below). Then Sec 1 and especially 1.1. are introducing many issues that non-crucial discussion in my view, and this point we are still not clear what is the problem the paper is addressing.

I believe that paper can start with some motivation and context (eg. of one longer paragraph like on page 3 line 1-25) but then we there should be small illustrative example of the problem. Since, the problem we are solving very concrete this should be that hard to do.

Immediately after, the paper should introduce the contributions (like one in the conclusion), eg., one paragraph/bullet per contribution so it clear what the paper is about.

Overall, the paper should be significantly rewritten and re-arranged to improve the presented and also to reduce non-important parts, as well as enriched direct-to-point examples.

M2. Originality and significance of the results
Scientific originality is rather simple from a theoretical point of view. It is more interesting from engineering point of view, and this is probably the strongest contribution of the paper: doing the work of combining SHACL, mappings and Elasticsearch engine. Still, all those components are standard tooling in Semantic Web so it is not clear if this can be considered as a sufficient original scientific contribution. Lastly, the evaluation does not favors the proposed solution, at least the authors did manage to show a delta there.

Also there are some other unclear issues that the should be better discussed in the paper:

M3 The problem of defining mapping maybe hides SPARQL complexity but introduces mapping complexity. In particular, somebody would need to maintain such mappings, and this somehow just hides the actual problem that we need data engineer in the loop constantly.

M4 It is not clear why we need SHACL inference rules (also why data is not labeled at the first place). In particular, such data enrichment can be done via SPARQL as well, and in fact, the definition would be very similar and easier to execute (one engine less). The right question is: are there any advantages doing such inference of data in real time or it can be done offline?

M5 When comparing SPARQL and Elasticsearch queries, indeed Elasticsearch look simpler but then they are very specific to the platform and I think they look more ambiguous than SPARQL counterparts. Also, I am not sure that your argument on using Elasticsearch does not match with showed testing. I think what you want to show that the most intuitive answers are ranked first, which simplifies search with many objects, instead of testing response time, precision and recall.

M6 Another point is that, this kind of engine makes more sense if data in Text (at least longer text) and with poor structure. Here seems that RDF data comes with solid structure. At least the argument for Elasticsearch should be better shown in examples (in the text was better explained)

M7 Lastly, transforming to JSON data means that we create additional data, that requires management and storage. What is the size of this additional (I imagine we only transform meta data)?

# More Detailed Comments:

Page 1:
Line 18: What is "reserach data"? At this point it is not clear.
Line 19: What is searchability? Also not clear at this point.
Line 20: Sentence "The presented semantic object mapping..." is not fully clear. Presented where, which object? Is this about contribution or we are still the describing the problem?

Second paragraph in the abstract is also not clear. The problem that the paper is addressing should be better described.

Line 51: "Semantic resources object mapping" its not clear what exactly is mapped to what?

Page 2:

Line 3: term "resource" used here is not clear (also used elsewhere in the text). Only I later one can realize that it means "RDF resource" but then sometimes we talk about URIs.

Section 1.1 should more direct to the point. Many sentences are too generic and may not be crucial for paper development. At this point is still not clear what problem is paper addressing and this makes paper less understandable. I would shorten 1.1 taking only the paragraph on page 3 from from line 1-36 and add some description on PID. In any case prolonging the definition of the problem is not good for readability.

Page 3:

Line 22: why searchability is so important? SHACL is RDF format as well and it can be queried with SPARQL, so it is not clear what other kind of search one would like to have. SPARQL similar to SQL, is a query language, I believe for any data scientists this is a basic tooling (how much simpler it can be?). Again it becomes more clear latter (at the end basically).

Line 24: What is Elasticsearch search index?

Line 32: What is Coscine Knowledge Graph

Line 44-47: "The background of the presented
45 paper is the elaboration and evaluation of a resource
46 object mapping to build such a search considering the
47 data structure and requirements of Coscine (see section
48 2.6)." Completely not clear. I am not sure if this is an English problem or the authors fail to define notations they use in the paper.

Page 3, second column

Lines 1-22: First, these observations seems not fully correct, second why this is important for the problem (which we still do not know what is it at this point). Adding more text that is not crucial for the paper makes reading rather frustrating.

Line 35-36: So, a non-technical user that should be also data scientist (or that wants to use scientific data), I am not convinced that you actually have a concretely defined scenario here. Can you be more precise?

Page 4

line 4: which is -> that. Also how shacl helps search here.

lines 1-29: Here we discuss the contributions, but again it is not clear what are they. This part would require significant rewriting. Eg., why data is now translated to JSON and how to apply SHACL on JSON?

Page 5

CWA and OWA does not require full explanations, as they are not relevant for this paper and this very much standard knowledge. This section can be dropped completely.

Page 6.

Maybe one can spend more space on detailed SHACL language, e.g, what is the syntax and semantic and how the validation works, since this is less obvious (even the w3c document is not the most precise).
For instance, these two works provide abstract definition for shalc and inference rules, which can be taken as an example

https://link.springer.com/chapter/10.1007/978-3-030-30793-6_31
https://link.springer.com/chapter/10.1007/978-3-030-00671-6_19

Page 6 and 7

Readability in many sections can be improved if add a sentence or a paragraph describing what is going to be presented. E.g, Fig 1 is again not that important for the story if we are not clearly defining the problem. It still looks as a motivation for the problem we already at page 7.

Page 9

I did not understand what are the additional rules. What is "referenced data in the last step"? Maybe one can simplify this example (not making them full-fledged) and focus on providing more concrete what is an issue.

Page 10

The meaning of Def 1 not clear. Actually, what is the mapping? Could you give an example how do you define it.

Def 1, assumes O (the object) to be a literal. What if it is another object and how do you deal with that? e.g, <:John , :likes , :Mary> is ignored or Mary becomes a literal. Also are there any consequence of such flattening of an RDF graph in a JSON tree?

Review #3
By Simon Steyskal submitted on 28/May/2021
Suggestion:
Major Revision
Review Comment:

## Abstract
---

```
[p.1, 17]: "an RDF based metadata" -> a metadata? I would either write remove "an" or s/metadata/metadata record/
[p.1, 19]: "Searchability is achieved by SPARQL, which, however, requires considerable technical knowledge." -> but SHACL doesn't?
[p.1, 20]: "an JSON object" -> a JSON object
[p.1, 20]: "... is an approach which is less dependent on the users background." -> wrt? why? because there's no SPARQL involved?
[p.1, 22]: "The evaluation was performed based on.." -> the evaluation of what? what aspects of your approach are you going to evaluate?
[p.1, 23]: "a solution was found that is equivalent in these terms." -> equivalent to what? what terms?
[p.1, 24]: "Through the integration of the developed mapping in data management platform Coscine," -> "mapping in"? did you mean to say "within the data man..."?
[p.1, 25]: "which at the same time promotes the subsequent use." -> subseq use of what?

[general]: I would appreciate 1-2 sentences elaborating on the concept of "research data". (what it's used for, what information it usually contains, etc. pp)
```

## Introduction
---

```
[p.1, 41/48]: since you are citing other W3C standards too -> add citations for RDF and SHACL
[p.1, 43]: "A standardized presentation of the data makes the information machine-readable and thus a search possible" -> does it though? what if the standardized way of presenting certain (research) data is to make handwritten copies? rephrase!

[p.2, 45]: "research data should be documented ... and assigned to the researcher so that queries regarding the data can be made." -> assigned to the researcher? wouldn't linking it to the actual publication/study/etc. the data was a part of, make more sense? I could see myself looking for research data by traversing through a "linked" list of publications rather than asking for something along the lines of "give me all research data from 2020 that is associated with Sarah Bensberg"..
[p.2, 29]: "instead, every single researcher is under obligation" -> obligated to do what? curate data? rephrase!

[p.3, 34]: "because it is to be benefited the from structure and inference rules," -> what? who is to be benefited? the Coscine KG? clarify!

[p.4, 4]: "which is relies on SHACL" -> "which relies on SHACL" or "which is relying on SHACL"
[p.4, 18]: "The mapping applies inference rules ... to inject further human knowledge into the object." -> what? what human knowledge? are the inference rules inferring human knowledge which is subsequently injected into the object? is that even possible? ;)
```

## Background
---

```
[p.4, 7]: "conceived data model" -> "conceived data models"
[p.4, 16 and 39]: "Dublin Core [21]" vs "Dublin Core [23]" -> either add "terms" to [21] or merge citations

[p.5, 5-7]: "If a search is made for a specific piece of information within this system, the correct answer is found and only this answer exists" -> I get where you are coming from, i.e. broadly speaking: under CWA the absence of information means that said information doesn't exist as compared to OWA where absence of information doesn't mean it's not existing but just hasn't been "discovered" yet.. if that was also the message you wanted to convey, please rephrase!
[p.5, Example 2]: The inconsistencies shown in this example rely on the fact that "a person can only live in one country" which, however, is the only thing not formalized.. please add it to both parts of the example!
[p.5, 46-51]: "SHACL can be divided into the two languages SHACL Core language and SHACL-SPARQL. SHACL shapes thus provide the means to describe an application profile with technologies of the Semantic Web" -> SHACL shapes provide the means to describe app. profiles, because SHACL can be divided into SHACL core and SHACL-SPARQL? clarify and/or move the last sentence up (right after the validation part).

[p.6, 1]: either provide some context as to why you are providing an example of a SHACL shape or remove it
[p.6, 8-9]: "Using sh:property a property is defined which in the example corresponds to ex:author. " -> no, a shape of the property is defined not the property itself
[p.6, 11]: "class ex:author" -> "class ex:Author"
[p.6, 24]: "Coscine uses a project-based system" -> I would s/project-based/project-centered/ (or something along those lines)
[p.6, 31]: "which provides a resolution via the Handle system [34]" -> what's the Handle system?! provide some context!
[p.6, 42-44]: This reads like it's the fault of the RDF model's flexibility that you cannot define "specifications or req.".. but is it though? a counterexample: if I serialize my SHACL shapes graph as .ttl, I can express "spec. and req." while still using RDF (+ its flexibility).. right? So I would rather argue it was the lack of a suitable/dedicated language for validating RDF graphs against constraints and not primarily (just) the fault of RDF's flexibility.
[p.6, 51]: "allow inconsistencies and supposed contradictions" -> what are "supposed contradictions"?
[p.6, 28-29]: I would rephrase this to "a sh:NodeShape allows for validating that its focus nodes meet the shape's conditions, otherwise..."
[p.6, 39-51]: move the paragraph describing SHACL application profiles in front of the previous sentence/paragraph where you mention SHACL AP for the first time (or make a reference to it)

[p.7, 28]: remove "coscineengmeta:subject" (copy/paste error from [p.23, 11])
[p.7, 46-47]: how's that? why?
[p.7, 50]: "The illustrations shows" -> -s
[p.7, 2]: "Data types are also transformed" -> what data types? you mean sh:datatype constraints?
[p.7, 15]: "validate the metadata using the shape" -> what shape?
[p.7, 20]: s/Example B/Appendix B/
```

## Resource Object Mapping
---

```
[p.7, 48]: Within the resource graphs, predicates
[p.7, 51]: "is not interested in the URIs or does not even know it" -> doesnt even know what?
[p.8, 3]: "depending on the instance or class" -> what does this mean?
[p.8, 7-8]: "However, all used classes and bounding conditions are known" -> known by whom?
[p.8, 16]: "and thus make it explicit" -> make what explicit?
[p.8, 40]: s/\$this/$this/
[p.8, 48-49]: dfgCS:A409-02 a dfgCS:A409-02 -> what?

[p.9, 5]: dfgCS: a dfgCS: -> what?
[p.9, 12]: remove the weird linebreak symbol
[p.9, 34-37]: Why are human beings aware of further information and more importantly, why is this information withheld from a machine? who withholds it? Use the explanation of Example 11 here too.
[p.9, 37]: "The rules are used to slightly close this gap" -> why only slightly? why not attempt to close it completely?
[p.9, 14]: s/\$this/$this/
[p.9, 16]: ID == EPIC Persistent Identifier (PID) ? if not, what's the difference?
[p.9, 37-39]: just out of curiosity, is there any particular reason for storing version numbers as ints? I probably would have defined them as strings and thus also allowed for minor/major version increments like "1.1.7" (nvm, in Example 11 you get the newest version by sorting the entries by their version number..)

[p.10, 13]: this should actually be v2 rather than v1 right? => ` ex:isNewestVersion true .`
[p.10, Example 11]: what happens if there's a new version v3, which supersedes v2 and thus would generate following additional triple for v2 -> ` ex:isNewestVersion false .`? wouldn't this lead to the following -> ` ex:isNewestVersion false, true .`?
[p.10, 42-43]: "Also, the additional rules do not have to generate boolean objects." -> what? also why not?
[p.10, 20]: "by using the additional rules." -> THE additional rules? or any additional rules? if it's the former, what are those dedicated "additional rules"?
[p.10, 32]: what edges are you talking about? the edges of the graph? if yes, what are the "objects" of a graph? are objects == vertices/nodes? clarify!
[p.10, 35-37]: what's the difference between resources I and URIs U? why are only subjects \in I? What about blank nodes? are blank nodes == I == "resources"? assuming RDF as a basis, why go for URIs and not IRIs (see https://www.w3.org/TR/rdf11-concepts/#section-IRIs)? clarify!
[p.10, 40-45]: you aren't defining the actual mapping to JSON but trying to define the resulting JSON object.
what's the difference between \hat{G} and G'? how's \hat{G} actually defined?
[p.10, 43-44]: S', P', O' are missing their "i" subscript;
if \in G', and assuming G' is the "resulting mapped >RDF< graph of G' how can P'_i then be a string? or did you want to define a JSON object where RDF properties become "string" keys and RDF objects the respective values?

[p.11, 43-50]: examples 7 and 9 only construct rdfs:label triples, so where are the additional "creators" coming from?
[p.11, 3]: it's not clear that apparently you use (only?) rdfs:label values when "flatmapping" properties to their JSON values.. please emphasize this earlier in the paper..
[p.11, 38-40]: what? also is it: (same properties are defined in different SHACL shapes (with the same data type or are described using instances of one class)) or ((same properties are defined in different SHACL shapes with the same data type) or are described using instances of one class))?
[p.11, 42-43]: what about things like xsd:double, xsd:date, etc.? (in Table1 you mention "different data types" -> string.. in case "different datatypes" means in that context "all other datatypes not mentioned" -> clarify this here too! )

[p.12, Table 1]: what if a property is described only in one shape?
```

## Evaluation
---

```
[p.12, 17-19]: "the context of Coscine was used ... in the context of searchability was evaluated" -> what? is the first "context" you mention actually referring to a JSON-LD context?
[p.12, 19-22;50-51]: "For the considered tests described in this section...All of them are based on ... [43]" vs "The evaluation is based on data record [40]" => [43] == [40]?
[p.12, 27-28]: what assumptions are you referring to? your hypothesis? clarify!
[p.12, 39-11]: it would be nice if you could visually indicate the "specifics" about each query.. e.g. "Data records about \textbf{political left}".
[p.12, 18]: s/on IT Center/at the IT Center/
[p.12, 28]: s/in the version/in version/
[p.12, 30-32]: why did you change those parameters? what do they do?
[p.12, 50]: a dataset with just 35 records is imo not only super small but also insufficient for drawing any conclusions wrt. performance overhead etc. (btw. up to this point the reader doesn't really know that there's also another (larger) dataset you did use for evaluation later on..) why not just use the "bigger" dataset?

[p.13, 28-29]: the ranking is based on what?
[p.13, 30-31]: what are stop words?
[p.13, 51-8]: "This semantic problem also exists when using SPARQL queries" -> would it though? with SPARQL, I, e.g., could just use a more sophisticated query and/or utilize the structure of the RDF data?
[p.13, 18-27]: I appreciate the reference to Nielsen [46].. I wasn't aware of this work, thx! :)
[p.13, 33]: what? rephrase!
[p.13, 38]: what visibility restrictions?
[p.13, 44-45]: "An execution time of 221 ms was measured." -> on average? how many runs? ..?
[p.13, 50-51]: "was executed for each query 100 times in a row" -> did you cater for any "caching" that may occur between consecutive runs? hot/cold loads?

[p.14, Table 4/5]: any idea why the avg mapping time stayed the same/decreased for the larger record?
[p.14, Table 5]: any idea what could've caused the extreme outlier of +- 129 for the "non Comput. Human Consci." intention?
[p.14, 31-35]: again, why not use the 10k record dataset as your "primary" dataset?

[p.15, Table 7]: Reindexing takes +- 1min is this a normal/expected range? if not, any ideas as to why?
[p.15, 29]: "The average value for 100 consecutive executions is 3217 ms." -> is this somehow represented also in Table 7, where you report 837+-143 and 3055+-233 as times for adding records.. ?
```

## Related Work
---

```
[p.16, 3-5]: what? what can be found in other properties?
[p.16, 27-28]: "the domain is not known in advance or can change by further " -> "and can change" right?
[p.16, 33-35]: move "is interesting" to the end of the sentence
[p.16, 41-43]: why is it possible to store "human knowledge" using inference rules? what's exactly "human knowledge" in that context?
```

## Related Work
---

```
[p.16, 5-6]: " when prepared to generated SPARQL queries." -> what?
[p.16, 16]: which are ...?
[p.16, 17;22]: "Two disadvantages of the mapping...do not bring any disadvantage for them" -> so is it now a disadvantage or not? rephrase!
[p.16, 24-26]: what? extension of what? great influence == ? elaborate!
[p.16, 27-32]: because ...?

[p.17, 1-2]: how does this relate to [p.16, 43-44] where you say "This means that all knowledge or structures must be aware of and considered in adv." ?
[p.17, 10]: "SHACL validated SHACL rules" -> " SHACL validated; SHACL rules" ?
```

## References & Appendix
---

```
[20,24,35,39,...]: Please cite W3C standards (and/or URLs in general) properly!
[general]: consider ordering the references alphabetically making it easier to spot any duplicates

[p.23, 11]: shouldn't it be "sh:class dfg:faecher" ?
```

Review #4
By Vladimir Alexiev submitted on 29/Jul/2021
Suggestion:
Reject
Review Comment:

# Intro

The paper presents a practical implementation of search for a research metadata repository based on Elasticsearch.

In addition to this submission I have:
- Browsed Sarah Bensberg's MS thesis (An efficient semantic search engine for research data in an RDF-based knowledge graph)
- Browsed through the software artefacts: https://git.rwth-aachen.de/coscine/research/semanticsearch
- Read about the overarching project COSCINE (How to Manage IT Resources in Research Projects- Towards a Collaborative Scientific Integration Environment (EUNIS 2020); Introducing coordinated research data management at RWTH Aachen University- a brief project report)
So I think I have a good understanding of the work.

# Major Defects

## Lack of Originality

The importance of FTS doesn't need to be explained, and the benefits of using established FTS engines are clear. Many semantic repositories already use this approach, eg:

- Rdf4j has FTS with the Lucene SAIL (https://rdf4j.org/documentation/programming/lucene/). There are implementations org.eclipse.rdf4j.sail.* for lucene, elasticsearch (https://rdf4j.org/javadoc/latest/index.html?org/eclipse/rdf4j/sail/elast...) and solr
- Jena has FTS via Lucene (https://jena.apache.org/documentation/query/text-query.html). Solr and Elasticsearch were offered in earlier versions but in the latest version. Indexing is configured declaratively with RDF (https://jena.apache.org/documentation/query/text-query.html#configuration)
- Stardog uses Lucene (https://docs.stardog.com/query-stardog/full-text-search)
- Neptune uses Elasticsearch (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html)
- Halyard uses Elasticsearch (https://merck.github.io/Halyard/usage.html#cooperation-with-elasticsearch)
- Ontotext GraphDB has connectors for Lucene, Solr, Elasticsearch (and Mongo) (https://graphdb.ontotext.com/documentation/enterprise/connectors.html). They are configured with declarative JSON and allow flexible filtering and mapping (https://graphdb.ontotext.com/documentation/enterprise/elasticsearch-grap...). They work incrementally.

Section "Related Work" doesn't comment on these core functionalities and instead discusses some more peripheral applications of FTS.

## Key Problems of RDF FTS Indexing

IMHO the important problems/features to be considered in RDF FTS Indexing are:

- Using RDF inference (in particular incremental inference) to simplify properties passed to indexing
- All examples in 3.1. Literal Rules can be implemented with RDFS+ reasoning (`rdfs:subPropertyOf, owl:inverseOf, owl:propertyChainAxiom, owl:TransitiveProperty`), so there's no need to use SHACL rules. While incremental inferencing can be implemented efficiently (eg as in GraphDB), that's much harder or impossible for SHACL SPARQL rules.
- Some SPARQL inferences are used to index more complex situations (eg `is_newest_version` marking the latest version of a resource or `isLastStep` marking the last step of a workflow). These cannot be implemented with GraphDB incremental rules (which don't support negation due to the monotonic nature of OWA). However:
- Rather than comparing version numbers, these should be mapped to explicit RDF relations `prevVerison, nextVersion`
- Then they can easily be checked at query time (eg `NOT nextVersion:*`, `NOT predecessorOf:*`)
- They could be mapped to a Boolean at indexing time (eg such remapping is allowed by GraphDB's connectors)
- Automatic and incremental FTS indexing (synchronization to Elasticsearch), which requires setting notifications for certain triple patterns, and computing which FTS entities (Elasticsearch documents) need to be resubmitted for indexing.
- The paper doesn't address that: SHACL queries need to be rerun periodically, or through some external external mechanism on document creation or update.
- What happens on name change of a high-level organization (Example 8) or subject (Example 9)? How will you know which documents to reindex?
- Providing high scalability eg through Elasticsearch cluster, and the interplay of an eventual RDF database cluster with that Elasticsearch cluster.
- The paper puts forth "ES should be used as a search engine because it is scalable": as motivation, but does not explore this question
- Elasticsearch supports nested objects and relations between objects.
- These are important for more complex indexes (consisting of several entity types) and more complex searches, eg:
- "find papers on certain subjects, and return the organizations of the authors"
- "find cross-disciplinary papers, i.e. papers written by authors who specialize in different fields"
- "find resources where two observation values are in a certain comparison relation, also taking the units into account"
- They are also important for implementing facets (see further) and for navigating from a document to its subject(s) to related documents.
- In contrast, the paper maps any RDF entity to a flat Elasticsearch document
- Also, the identities of referenced objects (eg Authors, Subjects) are stripped (see "the mapping of Definition 1 is not injective")
- A true semantic search must incoroporate Conceptual search that works in conjunction with autocompleting selection from thesauri using multilingual word resolution, followed by hierarchical expansion (eg along skos:broader).
- Translating the user need "Data records about political left" to the mere word "left" is inaccurate because the word is highly ambiguous, and would miss eg a German article talking about the same concept.
- To use conceptual search effectively, the records/documents must be indexed using Concept Extraction (Semantic Text Analysis). There are many tools, such as Babelfy, Ontotext CES/TAG, DBpedia Spotlight, etc)
- The paper only mentions NLP in brief: "Based on the state of research on search in Natural Language Processing (NLP) oriented approaches from general Information Retrieval...
- The paper doesn't consider semantic annotation or semantic search and claims "For a computer, this semantic difference is not recognizable", blithely obliterating a whole burgeoning area of research
- Fields should be mapped to appropriate indexing regimes (eg phrase with Language Analysis for text, but keyword for identifiers like DOI or ORCID ID). Appropriate Analyzers for the particular languages of the labels should be used for stemming, stop word removal, etc.
- The authors don't seem to address that in their indexing, and just mention in Concluding Remarks "ES is additionally able to map synonyms and find results by using stemming".
- Faceting is just as important as FTS: I don't know any Digital Library without some sort of faceted search. FTS engines implement faceting in a great way, and it's folly to use SPARQL COUNT to implement facets. To use Elasticsearch you need to preserve the identity of each facet value, specify how to do faceting, and use faceting features in the query.
- The paper doesn't consider faceting at all.
- The MSc thesis considers faceting but doesn't seem use Elasticsearch for faceting, doesn't rank it highly, and stumbles upon the problem of stripped identities of facet values: "a workaround must be developed so that the fields can be clearly distinguished and offered to the user as search fields. This problem exists in the faceted search... because there the labels and not the URIs are used for display"

# Lesser Objections

- IMHO the title is misleading. It's not about validation but about using SHACL to extract text for indexing. It's unclear why is that better than using
- As other reviewers mentioned, there's too much unnecessary introductory material. Sections 2.2 to 2.6 should be shortened significantly
- The Evaluation approach uses unrealistically small data: 10k records. Consider that Microsoft Academic has 290M records (papers and patents), PubMed has 30M records, etc: 10k is a very small number.

## Decision

Decision: reject. This work is good enough for a MSc thesis, but not for SWJ publication.