Review Comment:
The paper contributes an empirical study of descriptive metadata of LOD datasets, as found in the LOD Cloud and Annohub repositories, focusing on linguistics datasets. In fact, the study is largely based on an enriched repository, which combines and enriches the information found in either metadata repository.
The empirical study is certainly inspired by similar efforts concerned with the whole LOD cloud, and for this reason the study can be considered an incremental research; nonetheless, the focus on the Linguistics Linked Open Data sub-cloud improves the originality of the research and makes the findings significant to the community around the LLOD cloud.
The paper cites numerous papers relevant to the proposed empirical study; however, the authors should put some effort to better explain how the paper is positioned with respect to related work, e.g. whether the paper confirms the findings of related work, falsify the findings of related work or add new (perhaps complementary) results.
From a methodological viewpoint, the paper should clarify right from the beginning of Section 4 the role of the enriched metadata. Indeed, in the subsequent sections it appears clear that the authors will compute some statistics on the original metadata and on the enriched one. Another related concern is whether some inconsistencies (e.g. different names for the same language, linguistic vs linguistics) were fixed in the “original” metadata or just in the “enriched metadata”. In the end, I would like a better explanation of the methodology, possibly accompanied by a figure depicting it.
The paper should include concrete examples of metadata for all repositories, in order to make the discourse more grounded. Additionally, the alignment of the two metadata repositories should be given in a more explicit manner (e.g. a table).
It is unclear to me how languages have been represented, e.g. by name, language code in some standard, or language URI (in some datasets, such as Lexvo). The authors should clarify (they could look at this message on the OntoLex mailing list for interesting insights https://lists.w3.org/Archives/Public/public-ontolex/2020Apr/0012.html)
Furthermore, it seems that accessibility has been reported as a comment, which may not be considered a self-describing, semantically clear approach.
I am unsure whether the enriched metadata repository is valuable as a standalone catalog or is just an artifact instrumental to the quantitative study. Furthermore, the paper does not provide a link to access, download or otherwise evaluate this repository. Moreover, I do not understand whether the authors have a plan to maintain/update the enriched metadata as the source repositories are updated.
The paper does not discuss the automatic enrichment process in sufficient detail, as explained later in the review.
Other major concerns per section which should be addressed by the authors.
* Introduction *
I am concerned about the absence of any citation to the FAIR principles:
Web site (the most up to date)
https://www.go-fair.org/fair-principles/
(scientific paper)
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1-9.
* Background *
As the authors are addressing metadata about language resources published as Linked Data, I think that it would be appropriate to include approaches for the description of linked datasets, such as VoID, VOAF, DCAT, DataID, HCLS, LIME. The latter is a module of OntoLex-Lemon, a release-candidate version of which has been described in a research paper:
Fiorelli M., Stellato A., McCrae J.P., Cimiano P., Pazienza M.T. (2015) LIME: The Metadata Module for OntoLex. In: Gandon F., Sabou M., Sack H., d’Amato C., Cudré-Mauroux P., Zimmermann A. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2015. Lecture Notes in Computer Science, vol 9088. Springer, Cham. https://doi.org/10.1007/978-3-319-18818-8_20
The authors also forgot to mention the metadata vocabularies adopted in the repositories they are going to use. In fact, these repositories use metadata profiles combining several metadata vocabularies. For example, Annohub reuses Dublin Core and DCAT, among others.
The authors SHALL look at and cite the following work, which describes describes DCAT, DCTERMS and META-SHARE OWL, and explicitly take into consideration its findings:
Cimiano P., Chiarcos C., McCrae J.P., Gracia J. (2020) Modelling Metadata of Language Resources. In: Linguistic Linked Data. Springer, Cham. https://doi.org/10.1007/978-3-030-30225-2_7
The authors should differentiate between the original XML-based schema and the new META-SHARE ontology. The following work describes the effort for developing the META-SHARE ontology:
McCrae J.P., Labropoulou P., Gracia J., Villegas M., Rodríguez-Doncel V., Cimiano P. (2015) One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web. In: Gandon F., Guéret C., Villata S., Breslin J., Faron-Zucker C., Zimmermann A. (eds) The Semantic Web: ESWC 2015 Satellite Events. ESWC 2015. Lecture Notes in Computer Science, vol 9341. Springer, Cham. https://doi.org/10.1007/978-3-319-25639-9_42
The authors should also consider whether it is appropriate to consider and cite the following references:
Cimiano P., Chiarcos C., McCrae J.P., Gracia J. (2020) Discovery of Language Resources. In: Linguistic Linked Data. Springer, Cham. https://doi.org/10.1007/978-3-030-30225-2_14
Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal 29, 251–272 (2020). https://doi.org/10.1007/s00778-019-00564-x
Ben Ellefi, M., Bellahsene, Z., Breslin, J. G., Demidova, E., Dietze, S., Szymański, J., & Todorov, K. (2018). RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semantic Web, 9(5), 677-705.https://content.iospress.com/articles/semantic-web/sw294
Jonquet, C., Toulet, A., Dutta, B. et al. Harnessing the Power of Unified Metadata in an Ontology Repository: The Case of AgroPortal. J Data Semant 7, 191–221 (2018). https://doi.org/10.1007/s13740-018-0091-5
Vandenbussche, P. Y., Umbrich, J., Matteis, L., Hogan, A., & Buil-Aranda, C. (2017). SPARQLES: Monitoring public SPARQL endpoints. Semantic web, 8(6), 1049-1065.
https://content.iospress.com/articles/semantic-web/sw254
Ermilov I., Lehmann J., Martin M., Auer S. (2016) LODStats: The Data Web Census Dataset. In: Groth P. et al. (eds) The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science, vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_5
* Repositories *
From a reproducibility point of view, I would like to know when the repositories have been downloaded and (if available from the repositories or republished by the authors) a persistent link to the dumped data.
I would like to see example metadata about datasets in each repository.
* Methodology *
As anticipated, I would like to see a diagram/picture summarizing the methodology.
* Metadata Alignment/Mapping *
As anticipated, I miss a table or other structured means to indicate the mapping of the original metadata repositories to this mediated schema. That would also help understand the coverage of the original metadata repositories.
* Metadata Enrichment *
As anticipated, I am concerned about the automatic extraction process, which is described in a very implicit manner. Regarding the software implementation, I would like to know the following.
- license (it is open source?)
- availability (is it downloadable, reusable, free-of-charge?)
- architectural/implementation details
- more details on the supported extraction techniques
I do not like the extensive use of the modal "could", since it is unclear whether the authors have really implemented the software or are describing extraction strategies that they could implement (in the future).
Another big concern is about the enriched metadata resource, which is – at the time of review – not publicly available, and consequently it has been impossible to carefully review it.
* Metadata Overview *
It is unclear to me how the number of distinct datasets can be 1908 if LOD cloud and Annohub contains 1440 and 530 datasets, respectively, and 69 datasets are in common.
* LCRSubclass *
The authors seem not to use the members of LCRSubclass defined by META-SHARE, but instead they use new categories borrowed from the LLOD cloud diagram.
Why does Table 5 only reports datasets in the enriched repository? Perhaps, because these categories are not present in the original repositories.
The enriched model seems to lose the information on the annotation model found in Annohub.
* Resource Accessibility *
The authors claim that Annohub datasets are accessible since their accessibility was checked in Spring 2019. In my opinion, this is too much time ago. Moreover, thy did not say if this check was done by them (for this paper). Concerning the LOD Cloud the authors did not say when they checked the availability of the listed datasets.
The authors say that “There exist two ways to consume LLOD data”: in fact, there is another one, that is to say the HTTP resolution of the resource URI (remember that second Linked Data rule requires the use of HTTP(S) identifiers for resources).
Table 7 does not list AnnoHub. Indeed AnnoHub at least provide links to the downloadable dump. Additionally, the authors should say whether HTTP resolution works.
I am concerned about how the number at the end of the section (e.g. “Only 21% of linguistics…”) have been derived from the available tables. Please clarify.
Minor concerns:
P1, 26: “On the other hand, …” The use of this conjunctive adverb seems incorrect to me, since this sentence somehow confirms what has been told in the previous one.
P1, 30: “… alignment to a descriptive schema …” The authors could say explicitly that they used the META-SHARE ontology
P1, 41-42: “to develop standards and metadata suitable to…” the authors should phrase this better, as it is unclear whether they are talking about standards for representation and for metadata or standards (for something) and actual metadata about concrete resources. In the rest of the paper, the term “metadata” is often used to mean “metadata vocabularies/properties”, while many would have interpreted it as “description of an actual resource”.
P2, 1: “Semantic Web [3]” The authors should expand the citation: i) the canonical citation about the Semantic Web [Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5), 34-43.)], and ii) cite OntoLex-Lemon using the already cited paper and the URI of its specs [https://www.w3.org/2016/05/ontolex/]. Maybe spend a few words saying that it is the result of a community group [https://www.w3.org/community/ontolex/] that sought consensus and agreement on a shared model.
P2, 3: The right citation for the Linked Data is this: https://www.w3.org/DesignIssues/LinkedData.html
Additionally, the “principles” of LD are usually termed “rules”.
P2, 9: I think that it could be appropriate to also cite: Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
P2, 26-27: “provide an analysis of the current status of linguistics resources” the authors should perhaps proofread their work and see whether linguistic or linguistics should be used (consistently) in this and other contexts.
P3, 21-22: “…with the aim to produce a web application to make such data querable..” I do not think that the goal of Chiarcos et al is to produce a web application.
P3, 27-29: “(i.e., the variations of language encoding standards and the lack of common metadata schemas for LD),”, Looking at the cited paper, I think that more than the lack of common metadata schemas, Annohub attempts to precisely describe the language and the language annotations (e.g. tagsets used in a resource, etc.). The authors should revise.
P3, 37-38: “An attempt to model linguistic LD datasets is the one by Bosque et al.”, I think that more than "an attempt to model linguistic LD datasets", the cited paper is a survey of models.
P4, 45: “The LOD Cloud is a diagram that offers…” I am unsure whether is appropriate to call LOD Cloud since the authors have referred to it as a metadata repository.
P5, 31-33: “(e.g., thesauri from tourism or life sciences, such as EARTh – the Environmental Applications Reference Thesaurus [21]);” speaking of thesauri, I suggest the authors to look at “The AGROVOC Linked Dataset”: Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., & Keizer, J. (2013). The AGROVOC linked dataset. Semantic Web, 4(3), 341-348.
Indeed, AGROVOC is part of the LLOD cloud and provides machine-readable metadata in form of a VoID description (http://aims.fao.org/aos/agrovoc/void.ttl#Agrovoc), also containing LIME metadata allowing for a richer linguistic description. Following the recipes contained in the VoID specification, metadata can be found easily, as each resource (e.g. http://aims.fao.org/aos/agrovoc/c_1071) is linked (through the property void:inDataset) to the dataset description. Tools may benefit from this mechanism to automatically find a dataset description once any of its resources has been reached. One such tool is VocBench 3:
(scientific paper)
Stellato, A., Fiorelli, M., Turbati, A., Lorenzetti, T., van Gemert, W., Dechandon, D., ... & Keizer, J. (2020). VocBench 3: A collaborative Semantic Web editor for ontologies, thesauri and lexicons. Semantic Web, 11(5), 855-881. https://doi.org/10.3233/SW-200370
(web site)
http://vocbench.uniroma2.it/
(relevant documentation)
http://vocbench.uniroma2.it/doc/user/mdr.jsf
http://vocbench.uniroma2.it/doc/user/data_view.jsf#status_button
P5, 26-31: “The main reason for choosing these repositories relies in the type of information they encompass. The LOD Cloud [..] about annotated linguistic resources from reliable sources.” This paragraph at the end of section 3.2 (dedicated to Annohub) should be moved elsewhere, perhaps at the beginning of section 3.
P7, 34-35: “description to report a short free-text account”: missing bullet
P7, 45-46: “AccessLocation A URL to the SPARQL endopoint;” Looking at the documentation (http://www.meta-share.org/ontologies/meta-share/meta-share-ontology.owl/...), it seems to me that this bullet describes the specific use of this property made by the authors
P8, 47: “from Annohub (Original)” I think that the attribute “original” is not necessary here
P8, Table 1: the caption only cites LOD metadata (again, original may be unnecessary) and forgets Annohubs. Furthermore, the second column should be named LOD Cloud rather than LOD
P9, 1 “BioPortal”, add a citation to a scientific paper. Try to search a reference citation on the project web site
P9, 5-6 “federal or local governments”, I am fairly sure that this expression has been borrowed from some publication about data initiatives in the USA (which is a federal state).
P9, 11 “Prominent datasets…” add links and references to the mentioned datasets
P10, Table 2, The second column should be named “LOD Cloud”
P10, 43-45, “Apart from Swedish …. more than 100 linguistics datasets” The authors may consider to add a table or a chart representing the statistics/aggregations presented in this paragraph
P13, 34 “Jena”, Jena is in fact a framework, providing implementations of a triple stores such as TDB and Fuseki. I would add a reference to the other important RDF framework for Java, Eclipse RDF4J (https://rdf4j.org/), formerly OpenRDF Sesame (https://link.springer.com/chapter/10.1007/3-540-48005-6_7)
P14, 21, “In the metadata of the original LOD”, This should be phrased to state more clearly that the authors are talking about the metadata provided by the LOD cloud (if I am right)
|