Review Comment:
The paper presents a framework and implementation to evaluate the quality of LOD repositories published by cultural heritage institutions (GLAM). It introduces a set of 18 quality criteria across 4 dimensions and provides a reproducible infrastructure using Jupyter Notebooks and semantic vocabularies.
While the paper addresses an important and underexplored area in the Semantic Web, reproducibility and quality of the proposed approach provides several major issues thus reducing the impact and maturity of the contributions. These include inaccuracies in the related work, insufficient comparison with the state of the art, and limited methodological validation.
This is a relevant and timely contribution to the Semantic Web Journal's scope, particularly with its emphasis on LOD, metadata quality, and reproducible workflows.
Positive Points
1.Reproducibility: The paper provides openly available Jupyter Notebooks and RDF outputs, aligned with Open Science practices.
2.Use of Standard Vocabularies: It models the assessment results using well-known Semantic Web vocabularies (e.g., DCAT, DQV, VoID), promoting interoperability.
3.Focus on GLAM LOD: It addresses a relevant and underexplored niche in the Semantic Web: data quality assessment for LOD in cultural heritage.
However, the submission in its current form is not yet ready for publication due to the following major concerns.
Negative Points
1.Inaccurate Related Work: The description of existing tools (e.g., KGHeartbeat) is misleading, and many key approaches (e.g., Luzzu, RDFUnit) are omitted.
2.Oversimplified Metrics: Most quality criteria are evaluated using binary values (0/1), which do not adequately capture the nuances of aspects like completeness or trustworthiness.
3.Limited Evaluation: The framework is tested on only three similar repositories, without comparisons to existing tools or exploration of diverse LOD sources.
In the following more details are provided.
1. Related Work and Positioning
The description of existing tools is incomplete and in some cases misleading. For example, KGHeartbeat is described as merely a web interface producing tables and charts, whereas it is in fact accompanied by openly available code, metrics, and documentation (consider the resource paper in KGHeartBeat: An Open Source Tool for Periodically Evaluating the Quality of Knowledge Graphs. ISWC (3) 2024).
In addition, relevant frameworks such as Luzzu, RDFUnit, TripleChecker, and LOD Laundromat are not sufficiently discussed. These tools have also tackled the problem of assessing LOD quality using SPARQL-based metrics, ontologies like DQV, and reproducible methods. The paper needs a clear comparative positioning with these approaches, ideally via a structured table mapping dimensions, extensibility, reproducibility, and implementation support.
2. Metric Design and Interpretation
The paper proposes 18 criteria. Why those 18 metrics and not others? Why you didn’t explain the selection choice. Why most of these metrics are modeled as binary indicators (1/0), which oversimplifies the semantics of metrics such as completeness, interlinking, trustworthiness, or timeliness. Why?
Several metrics are insufficiently justified:
•Trustworthiness is equated to the use of special values like wd:Q4233718, which is not a standard practice and lacks broader justification.
•Timeliness only checks for the presence of a date but does not evaluate its recency or accuracy.
•Conciseness is inferred from duplicate owl:sameAs links, which may not be indicative of redundancy in many real-world modeling practices.
The authors should provide a clearer definitions, justification for metric choices, and possibly consider fuzzy or graded scoring for metrics that are not well captured by binary values.
3. Evaluation Scope and Generalizability
The evaluation includes only three LOD repositories (BNE, BnF, LDFI), all in similar European library contexts. This limited scope weakens the generalizability of the conclusions.
Moreover, the exclusion of the Zeri dataset due to server issues highlights the fragility of the proposed infrastructure in practical settings. A more robust setup (e.g., via local triple stores or Docker-based containers) should be considered.
No comparisons are provided against existing tools (e.g. KgheartBeat), which would be necessary to demonstrate added value or novelty.
4. Reproducibility and Implementation
The use of Jupyter Notebooks and semantic vocabularies (e.g., DCAT, DQV) for encoding the results is an excellent step toward reproducible assessment. However, there is no discussion of:
•How non-technical users (e.g., curators) can use or extend this framework.
•How the framework scales with large datasets or complex ontologies.
•Whether the notebook includes user-friendly abstractions or only raw SPARQL code.
A user scenario or workflow diagram would help convey the applicability and usability of the system beyond technical audiences.
5. Presentation and Clarity
The paper is well structured, but the long exposition of metrics could be significantly improved with summarized tables and examples. Figure 1 (data modeling) is useful, but would benefit from a concrete walkthrough.
The GitHub project is a welcome addition, but screenshots or a UI overview would help assess the practical value.
|