A reproducible framework to assess the quality of linked open data in GLAM datasets

Tracking #: 3796-5010

Authors: 
Gustavo Candela
Meltem Dişli
Milena Dobreva
Sally Chambers

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Full Paper
Abstract: 
Over the last decade GLAM (Galleries, Libraries, Archives and Museums) organizations have been exploring new ways to make their content available using the Semantic Web and Linked Open Data (LOD). The growing number of GLAM institutions converting their collections to LOD, coupled with the increasing demand for high-quality data, has made the assessment of LOD quality a critical concern. In addition, there has been a significant increase in the global interest among researchers in reproducible research, a cornerstone of Open Science, requiring code to generate the experimental results. This study aims to present a reproducible framework to assess LOD quality within the GLAM sector. Based on the literature, a number of data quality criteria were established including 4 dimensions and 18 criteria. Subsequently, four LOD repositories were assessed according to these criteria. The assessments revealed that the LOD repositories performed well on accessibility, while they did not yield satisfactory results for other criteria, such as contextual information. These results can serve as a benchmark for other LOD repositories. Additionally, the study provides a detailed analysis that could be beneficial for other organizations and researchers interested in making their digital collections accessible and reusable as LOD. The study concludes by identifying further research based on the implementation in real practice for data spaces, the Cultural Heritage Cloud (ECCCH) and FAIR advancement in GLAMs.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/Mar/2025
Suggestion:
Major Revision
Review Comment:

The paper introduces a reproducible framework for assessing linked data quality within the GLAM (Galleries, Libraries, Archives, and Museums) domain. The proposed methodology applies an established quality assessment framework, structured into four categories and 18 dimensions, to evaluate four datasets. The study highlights transparency, reproducibility, and practical applicability to real-world GLAM datasets.

Strengths
- Relevance to the Semantic Web – The paper aligns well with linked data quality assessment, particularly in the GLAM domain.
- Reproducibility & Transparency – The authors ensure reproducibility by providing publicly accessible Jupyter Notebooks for assessing the quality of Linked Open Data repositories.
- Comprehensive & Detailed Dimension Analysis – The breakdown of 18 quality dimensions offers valuable insights into various aspects of data quality.
- Scientific Rigor – The evaluation is based on a well-established quality assessment framework, derived from the literature.

Limitations & Areas for Improvement
- Limited Contribution for a Journal Article – While the release of a reproducible framework is valuable, the contribution lacks novelty. The paper primarily applies an existing framework rather than proposing new methodologies, insights, or enhancements in linked data quality assessment. A stronger methodological contribution would significantly improve the paper’s impact.
- Clarity in Dimension Categorization & Justification – Although the quality dimensions are extensively detailed, the paper does not explicitly distinguish which are tailored for GLAM and which are domain-independent. Furthermore, it is unclear whether these dimensions are solely inherited from the literature or have been specialized by the authors—clarifying this would enhance readability and transparency.
- Comparison with Open-Source Tools – Given the importance of reproducibility, the Related Work section should compare existing open-source tools used for linked data quality assessment. Additionally, KGHeartBeat should be properly cited using the ISWC 2024 resource paper instead of just linking to it.
- Dataset Selection Criteria – The process for selecting datasets is unclear. Explicitly stating the selection criteria would enhance transparency and reproducibility.
- License Configuration – Although the license is mentioned in the GitHub README, the project should be properly configured to include it.

Minor Comments
- Figure Readability – The readability of Figure 2 can be improved for better clarity and comprehension.
- Terminology Consistency (Datasets vs. Repositories) – The paper refers to four resources, but "datasets" is a more appropriate term than "repositories." Ensuring consistent terminology throughout the paper would improve clarity.

Review #2
By Marilena Daquino submitted on 02/Jun/2025
Suggestion:
Reject
Review Comment:

Sem Web
Candela

Subjectivity of the quality assessment and its motivation.
I agree on the importance of the topic at hand, but I would have better appreciated if the task of quality assessment was mentioned in the context of real-world tasks. At the end of the day, it is a very subjective matter what to evaluate. We could argue that how we evaluate it should not, anyway, it is very dependant on the objective.
- I would like authors to elaborate a little more on the motivation, considering that the evaluation is a means, and not the overall goal. Therefore authors should be clear on the goal of the assessment, eventually describing use cases that are not too generic as they are now. I.e. after an evaluation, what are the affordances of positively/negatively judged datasets?

State of the art
Can you elaborate on the rationale for collecting datasets in table 1? are these only examples? And what is the purpose? Are you showing the diversity of ontologies used?
In 3.1 you briefly dismiss the criteria to build your framework saying that you used a comprehensive literature review. However, the state of the art section only mentions frameworks, without diving into the details of their choices and of why you selected your criteria out of them. The state of the art section, or a methodology section, must give an account of measures and explain authors’ selection methods.
Moreover, again at the beginning of section 3.1, authors claim the selection is based on what they could represent in RDF and therefore test using SPARQL. This limitation must be justified along with other missing criteria.
Lastly, the selection of ontology terms to represent information quality aspects is not addressed in the state of the art with respect to criteria that the authors deem important for their framework.

Measures
Still keeping in mind that I struggle to get why authors present this specific selection of measures, and why it is relevant and to what, I would need some clarifications on specific aspects.
- why would interlinking be a measure to positively/negatively judge a dataset? Does it take into account the (potential) novelty of the dataset?
- why are syntactic validity and following criteria described in terms of examples? and why the user can choose the sample? is the user supposed to write their own SPARQL queries to make the test? In this case, the framework is not really reproducible, and the overall approach should be better explained, since from the introduction my impression was that a user can reuse your framework as-is and results MUST be the same every time one runs the code
- Overall, the measures are way too many and their description are way too superficial to justify a novel contribution in this topic. The selection of ontology properties to represent such aspects is as well presented as a factual piece of information, while there is no demonstration of their adequacy and their being the best fit.
- It is unclear after reading the section whether the way to actually assess the measures is performed. E.g. authors mention “It can be measured by identifying machine-readable information (e.g., dcterms:modified)” (p. 11, l. 28). Is the statement including the mentioned property expected to be already in the data source, or is that a statement built by the validators? Please clarify. In case the answer was the first, please also reflect on the feasibility of your framework, which would require a huge amount of annotations in the original sources that are not easy to obtain.

Results and discussion
Why do you mention at all the Zeri dataset as one of the sources for validation and then do not include results in table 5? Only later in the discussion we understand it’s because the server was down. I suggest authors remove it as case study.
The discussion does not highlight any peculiar success or interesting result obtained other than promoting reproducibility with a Jupyter notebook. I believe it is not sufficient to justify a journal article and authors should consider promoting a more substantial added value. I firmly believe the work has potential and it’s very useful to the community, although currently a little naive to be presented as a finalised, ready to be used, work.

Please, deposit your code on zenodo/FIgshare to get a DOI and include it in the article.

Review #3
Anonymous submitted on 09/Jun/2025
Suggestion:
Major Revision
Review Comment:

An interesting paper that is well related to the special issue. Main contribution is a systematic approach to an operationalization of quality criteria of linked open data in the GLAM domain. The motivation for the paper is well described and the proposal is well documented, and the paper presents testing on three datasets that serves as proof of concept. There are, however, room for approvement in this paper both in structure, content and terminology.

The main contribution of the paper is described as a reproducible framework and the provision of reproducible code and examples. The framework is the most prominent contribution of this work, whereas the queries and any scripts are a more convenient byproduct for the community with less scientific value in itself.

The use of the term “code” is not optimal as it appears very generic and rarely is used by itself. If the intent is to capture both python code and sparql queries, a term such as “scripts” or “instructions” may work better.

The paper mentions Tim Berners Lee as the only inventor of the semantic web, ignoring the rest of the authors in the cited paper.

The background starts by presenting numerous linked open data publication initiatives within the GLAM community. This is interesting as background and motivation, but the presentation falls between an example-based intro and a systematic overview. Presenting a table of LOD repositories appears like systematic result of some study, but it is not documented how this list was compiled. The term “data modelling” is described as playing a “vital role”, but they are mainly discussing the vocabularies that have been the outcome of (not always very) elaborated and systematic modelling.

The background section on assessing data quality in LOD is extensive and covers well existing work and projects but could benefit from a clearer structure that better highlighted the various topics that have been explored in the background study.

The data quality criteria section goes through a variety of criteria organized according to the a of dimensions. Each criterion is defined, and references are given. This brief style of presentation has the benefit of being clear and consistent, but the drawback is that it lacks a justification and contextualization of each criterion. A more systematic overview in one table of which criterion has been explored or presented in what paper, would have presented this more as a systematic overview. Now, the selection of criteria appears a bit pragmatic without being guided by some systematic approach. The formal definition given is somewhat overdone and could have been presented in a much more readable way. All false outcomes are 0 and no reason to repeat this for all. Most sparql queries are trivial and better presented as text only with the queries in an appendix. Those that are familiar with sparql do not need to see these queries, those that are unfamiliar will be better off with a precise textual description.
The criteria are discussed in an example-based style which sometimes lack better justification, and the realization is not always an indication of quality. E.g. trustworthiness is exemplified by the use of unknown and empty values. Stating that you do not know something is relevant, but the real problem in trustworthiness is to evaluate the truth of what is said. Other criteria, such as using labels and vocabulary documentation as measure of readability, is more straight forward. In general, each criterion could benefit from a more systematic discussion.

The section of Data modelling such be renamed. This is about the use of formal and agreed upon models and vocabularies as a criterion for reusability and interoperability.

The framework is evaluated by applying it on a selection of collections. No data is provided for Zeri collection since it was unavailable, and I do not see the purpose of mentioning this collection at all in the paper – or they could have found a different one. The experiments demonstrate the use of the criteria but mainly validates them and do not verify them in terms of purpose and usefulness. This is maybe one of the biggest issues when evaluating the quality of linked open data. Although this probably is out of the scope for this paper, I would have liked to see more discussions on the use of each criterion.

All in all, an interesting paper, and the main strength of the paper is that it summarizes and gives an overview over quality criteria for GLAM linked open data and exemplifies and contextualizes. However, my general impression is that this is work in progress with many areas of improvement but surely worthy of publication in the end.

Review #4
Anonymous submitted on 14/Jul/2025
Suggestion:
Major Revision
Review Comment:

The paper presents a framework and implementation to evaluate the quality of LOD repositories published by cultural heritage institutions (GLAM). It introduces a set of 18 quality criteria across 4 dimensions and provides a reproducible infrastructure using Jupyter Notebooks and semantic vocabularies.
While the paper addresses an important and underexplored area in the Semantic Web, reproducibility and quality of the proposed approach provides several major issues thus reducing the impact and maturity of the contributions. These include inaccuracies in the related work, insufficient comparison with the state of the art, and limited methodological validation.

This is a relevant and timely contribution to the Semantic Web Journal's scope, particularly with its emphasis on LOD, metadata quality, and reproducible workflows.

Positive Points
1.Reproducibility: The paper provides openly available Jupyter Notebooks and RDF outputs, aligned with Open Science practices.
2.Use of Standard Vocabularies: It models the assessment results using well-known Semantic Web vocabularies (e.g., DCAT, DQV, VoID), promoting interoperability.
3.Focus on GLAM LOD: It addresses a relevant and underexplored niche in the Semantic Web: data quality assessment for LOD in cultural heritage.

However, the submission in its current form is not yet ready for publication due to the following major concerns.

Negative Points
1.Inaccurate Related Work: The description of existing tools (e.g., KGHeartbeat) is misleading, and many key approaches (e.g., Luzzu, RDFUnit) are omitted.
2.Oversimplified Metrics: Most quality criteria are evaluated using binary values (0/1), which do not adequately capture the nuances of aspects like completeness or trustworthiness.
3.Limited Evaluation: The framework is tested on only three similar repositories, without comparisons to existing tools or exploration of diverse LOD sources.

In the following more details are provided.

1. Related Work and Positioning
The description of existing tools is incomplete and in some cases misleading. For example, KGHeartbeat is described as merely a web interface producing tables and charts, whereas it is in fact accompanied by openly available code, metrics, and documentation (consider the resource paper in KGHeartBeat: An Open Source Tool for Periodically Evaluating the Quality of Knowledge Graphs. ISWC (3) 2024).

In addition, relevant frameworks such as Luzzu, RDFUnit, TripleChecker, and LOD Laundromat are not sufficiently discussed. These tools have also tackled the problem of assessing LOD quality using SPARQL-based metrics, ontologies like DQV, and reproducible methods. The paper needs a clear comparative positioning with these approaches, ideally via a structured table mapping dimensions, extensibility, reproducibility, and implementation support.

2. Metric Design and Interpretation
The paper proposes 18 criteria. Why those 18 metrics and not others? Why you didn’t explain the selection choice. Why most of these metrics are modeled as binary indicators (1/0), which oversimplifies the semantics of metrics such as completeness, interlinking, trustworthiness, or timeliness. Why?
Several metrics are insufficiently justified:
•Trustworthiness is equated to the use of special values like wd:Q4233718, which is not a standard practice and lacks broader justification.
•Timeliness only checks for the presence of a date but does not evaluate its recency or accuracy.
•Conciseness is inferred from duplicate owl:sameAs links, which may not be indicative of redundancy in many real-world modeling practices.
The authors should provide a clearer definitions, justification for metric choices, and possibly consider fuzzy or graded scoring for metrics that are not well captured by binary values.

3. Evaluation Scope and Generalizability
The evaluation includes only three LOD repositories (BNE, BnF, LDFI), all in similar European library contexts. This limited scope weakens the generalizability of the conclusions.
Moreover, the exclusion of the Zeri dataset due to server issues highlights the fragility of the proposed infrastructure in practical settings. A more robust setup (e.g., via local triple stores or Docker-based containers) should be considered.
No comparisons are provided against existing tools (e.g. KgheartBeat), which would be necessary to demonstrate added value or novelty.

4. Reproducibility and Implementation
The use of Jupyter Notebooks and semantic vocabularies (e.g., DCAT, DQV) for encoding the results is an excellent step toward reproducible assessment. However, there is no discussion of:
•How non-technical users (e.g., curators) can use or extend this framework.
•How the framework scales with large datasets or complex ontologies.
•Whether the notebook includes user-friendly abstractions or only raw SPARQL code.
A user scenario or workflow diagram would help convey the applicability and usability of the system beyond technical audiences.

5. Presentation and Clarity
The paper is well structured, but the long exposition of metrics could be significantly improved with summarized tables and examples. Figure 1 (data modeling) is useful, but would benefit from a concrete walkthrough.
The GitHub project is a welcome addition, but screenshots or a UI overview would help assess the practical value.