Quantifiable Integrity for Linked Data on the Web

Tracking #: 3185-4399

Christoph Braun
Tobias Kaefer

Responsible editor: 
Elena Demidova

Submission type: 
Full Paper
We present an approach to publish Linked Data on the Web with quantifiable integrity using Web technologies, and in which rational agents are incentivised to contribute to the integrity of the link network. To this end, we introduce self-verifying resource representations, that include Linked Data Signatures whose signature value is used as a suffix in the resource's URI. Links among such representations, typically managed as web documents, contribute therefore to preserving the integrity of the resulting document graphs. To quantify how well a document's integrity can be relied on, we introduce the notion of trust scores and present an interpretation based on hubs and authorities. In addition, we present how specific agent behaviour may be induced by the choice of trust score regarding their optimisation, \eg, in general but also using a heuristic strategy called Additional Reach Strategy (ARS). We discuss our approach in a three-fold evaluation: First, we evaluate the effect of different graph metrics as trust scores on induced agent behaviour and resulting evolution of the document graph. We show that trust scores based on hubs and authorities induce agent behaviour that contributes to integrity preservation in the document graph. Next, we evaluate different heuristics for agents to optimise trust scores when general optimisation strategies are not applicable. We show that ARS outperforms other potential optimisation strategies. Last, we evaluate the whole approach by examining the resilience of integrity preservation in a document graph when resources are deleted. To this end, we propose a simulation system based on the Watts-Strogatz model for simulating a social network. We show that our approach produces a document graph that can recover from such attacks or failures in the document graph.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Oct/2022
Minor Revision
Review Comment:

This paper studies and evaluates how to measure integrity for linked data by exploring linked data signatures and signed uris. The topic is new and the research relevant, which makes the article a useful piece of work. The paper is well organized and written. The authors provide background with the definition of relevant concepts.
The authors proposed the concept of trust scores to express document`s integrity. Although several technical aspects were studied around this notion (e.g. optimization), and strategies including heuristics, the study lacks a discussion around the human-centered aspect of the solution. How burden to users is a web of self-verifying resource representation because of all updates necessary as more interconnected resources are? This is a fundamental topic which is not discussed at all by the authors. How hard and viable would be to people update original resources to make your proposal ready to goal in real-world settings?
The study lacks further analyses and discussions regarding related work. Authors should better provide highlights of this study regarding its originality with respect to literature. A synthetic table comparing key related investigations would add value in this matter in section 3.
Although the paper provides several experimental evaluations, it lacks a discussion section to present overall analyses over the findings, discuss limitations and treats to validity.

Review #2
By Marvin Hofer submitted on 01/Nov/2022
Minor Revision
Review Comment:

=== General aspects of the paper ===
The paper "Quantifiable Integrity for Linked Data on the Web." describes in concise and clear terms how to establish trustable Linked Data (LD) networks by introducing the concept of LD integrity and presenting a decentralized technical solution for tamper-evident resources with traceable authorship.
Further, the authors discuss the notion of trust scores and strategies to incentivize agents to contribute to the integrity of an interlinked network.
Their technical solution combines Web technologies and the Solid Protocol to conduct self-verifying resource representations.

The work is sufficiently motivated and fits the journal's research track as it addresses several relevant issues related to the Semantic Web.
The authors highlight current and relevant issues of Linked Data integrity and trustworthiness.

=== Specific aspects ===
(1) Clarity and completeness of the descriptions.
Concerning the paper's writing, the overall language is very sound and comprehensible. The preliminaries section is very detailed and supports the reader by refreshing or understanding the necessary theoretical and technical principles. In chapter 4, the authors introduce a running example, presenting a theoretical architecture and scenario.

However, the intent to visualize the accessed problem for the reader can be beneficial, but why is it called a "running example" if it never occurs with this detail in the rest of the work?

Chapter 3 is excellent and covers an adequate amount of related work, showing current approaches and problems.

(2) Approach
The papers' foundation is supported by previous works (ref. 5, 11, 46). However, the contributions in this paper focus on new aspects or are extensions to earlier work. The authors separate their contributions into five sections, where sections 5-8 describe theories, concepts, and proposed approaches, while chapter 9 presents an experimental evaluation of the proposed solutions.

In chapter 5, "Self-Verifying Resource Representation" (SVRR), the authors propose their idea of self-verifying resource representation by introducing their idea of Singed URIs in combination with LDS.
Although the text describes the idea sufficiently, a final listing that includes the usage of Singend URIs as the subject or object of the wrapped tiples would be helpful. Furthermore, the listings should relate to the running example presented in Section 4.

Chapter 6, "A Web of Self-Verifying Resource Representations", investigates the idea of having SVRRs build a document graph. It describes how deletions must delegate to maintain the integrity of the graph. It also shows that integrity violations are detectable based on links set by other documents.
An unaddressed aspect regarding versioning aspects remains concerning creating or updating new documents. As the authors mention that they use RDF-Start to avoid the issue of accessing the state of triples directly, it still would be interesting to discuss the combination version strategies and the idea of SVRRs. For example, this is a common requirement for data, e.g., Wikipedia article revisions, leaving the opportunity to reference an older resource version instead of updating all transitively associated documents to preserve their integrity.

Chapter 7 is very sound. It utilizes the ideas proposed earlier to describe the definition of integrity preservation, integrity contribution, and different types of trust scores to provide a foundation of quantifiable web resource integrity. Additionally, the authors present the idea of using Hub- and Authority-based trust scores.

Chapter 8, "Score Optimisation", discusses the theoretical aspects of score optimization of newly created resources (documents) to increase their initial trust scores. (mentioned because of (3) Evaluation)

(3) Evaluation
A significant effort of the work focuses on evaluating the proposed ideas for trust score, score optimization, and possible integrity recovery in attacked or intentionally disconnected graphs (deletion of documents).

In 9.1, the authors present a formal analysis of three trust-score algorithms. They discuss their fitness as a trust score regarding the incentives for agents to link other agents' documents using an analysis based on game theory.
The authors provide proof of the insufficiency of the page-rank-based trust score in Appendix A, which shows that it punishes additional linkage.

In section 9.2, the authors evaluate their proposed heuristic to find the "next best link" (ARS) by comparing its correctness with other heuristics and an optimal link set.
Is there an explanation for the behavior shown in Figures 6 and 9 regarding the edge probabilities 0.1 and 0.05?
Additionally, the paper could mention the computational complexity of the compared heuristics and the optimal algorithm.

The last evaluation section, "Evaluation Cases Overview", shows an experimental evaluation of the recovery of document authority score in a corrupted graph (disconnecting the graph by deleting documents with a high betweenness centrality).
The evaluation executes a simulation of 4 phases and uses the Watts-Strogatz model for graph generation and the Monte Carlo method for randomness.
Although the setup description is highly sufficient, some possible interesting parameters are hidden from the user.
For example, the authors state, "The deletion of documents based on betweenness centrality is dependent on the created graph." which makes it unclear how many edges were deleted on average or what the threshold was.

Both experimental setups of 9.2 and 9.3 are missing a link to their implementation's source code, making it impossible to replicate their outcome.
In conclusion, each chapter between 5-8 was sound, but continuing with section 9 felt that later introduced or specified concepts would fit better in earlier sections. By reflecting on the structure and order of sections, the following formal suggestions arise 1) Explain the degree-based, PageRank-based, and HuA-based approaches in Section 7 (move). 2) Explain ARS and other heuristics in Section 8 (move).

=== Strengths ===
* a concise and clear writing style
* a comprehensive description of the current issues
* a sound methodology for the design of the technical solutions
* sufficient and broadly covering related work
* an overall good scientific evaluation concerning the proposed approaches
* extensive paper supplements through appendices A and B to show proof of made assertions

=== Weaknesses ===
* [minor] mismatch in figure 1, listing 1, and text (see "nonce" and "verification method.")
* [minor] mention of Emily in figure 3 but not in the text
* [minor] parts of chapter 4 as a running example are unconducive regarding the rest of the paper, as
* never mentioned again
* [minor] missing aspects regarding versioning strategies in 6. (not necessary to require delegation for each change)
* [minor] missing mention of computational complexity (or performance) of heuristics compared to the optimal algorithm.
* [minor] missing link or reference to experimental source code
* [minor] few IMO interesting numbers are missing in the evaluation (e.g., 9.3 deletion threshold)

=== Further formal comments ===
* Stating some legal issues that, if solved, would benefit the paper's appearance and readability.
renaming Chapter 4 into "Exemplary Architecture and Terminologie."
* An example of a final resource based on Listing 2 and the usage of Signed URIS would benefit clearness.
* I suggest using example triples related to the running example.
* 9.1.3 highlight the Degree-based, PageRank-based, Hub- and Authority-based terms different than Formal Analysis and Conclusion (maybe bold)
* Some parts of the paper (paragraphs in sections 5-8 and 9) are unintuitively structured concerning the specification of the two proposed algorithms (ARS and HITS) and other available methods.

In conclusion, the paper should be accepted, but minor revisions are required based on the comments.

Review #3
Anonymous submitted on 03/Nov/2022
Minor Revision
Review Comment:

This paper introduces an approach to publish linked data with quantifiable integrity, authors introduces different strategies for computing trust scores for quantifying the integrity of a web document.
- The paper contains original contributions that could be applied to increase the reliability of linked data published on the Web.
- The discussion and the proposed approach are sound in general, and provides meaningful insights.
- The paper is in general easy to follow.

- There are a lot of redundancies in the textual content of the paper and makes the paper unnecessarily long. I suggest the authors to reorganize the content throughout, maybe even merge some sections.
- Self-containment of the paper could be improved, e.g. to explain the not very well-know definitions or algorithms in the paper.

More comments:
- There is no formal definition or description of “representation” in Sec 2.1.1, although previous work of authors are cited, it would be better to define it here to make the paper self-contained.
- Many discussions in Sec2,3,4 are repeated in later sections with very similar statements. One example: the motivation and description of the scores are repeated in Sec 9.1.1. There are many more such cases throughout the paper, especially in Sec 9.
- Figure 3 is not showing the example as explained in the text, i.e. Charlie’s post linked from Alice’s), it would be good if it’s consistent with the text description.
- It would be good to have a brief description of the algorithms referred in Section 5, e.g. Hogan’s algorithm, SHA-256.
- Please double-check the quotation marks, e.g. ”upstream” should be “upstream”
- I’m not sure what is “c.p.” short for (P17, line 7)
- Maybe Sec8 can be accompanied by some visualizations.
- The ‘evaluation’ section is more of case study, discussion and proof of the algorithms, not sure if it suffices as evaluation.