Methodologies for publishing linked open government data on the web: a systematic mapping and a unified process model

Tracking #: 2695-3909

Authors: 
Bruno Elias Penteado
José Carlos Maldonado
Seiji Isotani

Responsible editor: 
Jens Lehmann

Submission type: 
Survey Article
Abstract: 
Since the beginning of the release of open data by many countries, different methodologies for publishing linked data have been proposed. However, they seem to not be adopted by early studies exploring linked data, for different reasons. In this work, we conducted a systematic mapping in the literature to synthesize the different approaches around the following topics: common steps, associated tools and practices, quality assessment validations, and evaluation of the methodology. The findings show a core set of activities, based on the linked data principles, but with very important additional steps for practical use in scale. Although a fair amount of quality issues are reported in the literature, very few of these methodologies embed validation steps in their process. We describe an integrated overview of the different activities and how they can be executed with appropriate tools. We also present research challenges that need to be addressed in future works in this area.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Erwin Folmer submitted on 09/Apr/2021
Suggestion:
Minor Revision
Review Comment:

This paper discusses the definition of an approach to the publication of linked open government data, focusing on the (business) adoption of linked data in practice; a topic which has been largely neglected in academic literature where there has been emphasis on the technical developments of linked data and a general neglect of the process adoption in practice. The paper is original and well written and can contribute to the understanding of the adoption process of linked data and, by introducing a methodology, might also contribute to improved adoption in the future. The structured literature review is an appropriate research methodology and is properly described.

However, I do see two main areas of improvement:

Firstly, I miss references to papers that I would have expected to appear. In a way, this is strange because the methodology described here is sound. I have tried to repeat the approach but I still cannot see an explanation as to why certain studies have not been reference. In particular, I think references relating to Spatial Linked Data are missing. Spatial Data is, in many cases, government data and is often open data. Indeed, a lot of spatial data has been published as linked data and is also available in the LOD cloud (see: https://www.w3.org/TR/sdw-bp/). None, or at least very few, references deal with this kind of government linked open data. This should be addressed here and so I provide the following:

Yearly conference is being held on this topic, see the report from the 2020 edition of this conference:
Bucher, Bénédicte, et al. "Spatial Linked Data in Europe: Report from Spatial Linked Data Session at Knowledge Graph in Action." (2021).

These papers contain overview of spatial (Linked Open Government) Data implementations across Europe:
Bucher, Bénédicte, et al. "Conciliating perspectives from mapping agencies and web of data on successful European SDIs: Toward a European geographic knowledge graph." ISPRS international journal of geo-information 9.2 (2020): 62.

Ronzhin, Stanislav, et al. "Next generation of spatial data infrastructure: lessons from linked data implementations across Europe." International journal of spatial data infrastructures research 14 (2019): 83-107.

Other existing paper include case studies in Ireland (DeBruyne):
Debruyne C. et al. (2017) Ireland?s Authoritative Geospatial Linked Data. In: d'Amato C. et al. (eds) The Semantic Web – ISWC 2017. ISWC 2017. Lecture Notes in Computer Science, vol 10588. Springer, Cham. https://doi.org/10.1007/978-3-319-68204-4_6

Belgium (Buyle):
af Buyle, Ziggy Vanlishout, Serena Coetzee, Dieter De Paepe, Mathias Van Compernolle, Geert Thijs, Bert Van Nuffelen, Laurens De Vocht, Peter Mechant, Björn De Vidts, Erik Mannens,
Raising interoperability among base registries: The evolution of the Linked Base Registry for addresses in Flanders, Journal of Web Semantics, Volume 55, 2019, Pages 86-101, ISSN 1570-8268, https://doi.org/10.1016/j.websem.2018.10.003.

and my own organization in the Netherlands:
E. Folmer, S. Ronzhin, J. Van Hillegersberg, W. Beek and R. Lemmens, "Business Rationale for Linked Data at Governments: A Case Study at the Netherlands’ Kadaster Data Platform," in IEEE Access, vol. 8, pp. 70822-70835, 2020, doi: 10.1109/ACCESS.2020.2984691.

Ronzhin, S.; Folmer, E.; Maria, P.; Brattinga, M.; Beek, W.; Lemmens, R.; van’t Veer, R. Kadaster Knowledge Graph: Beyond the Fifth Star of Open Data. Information 2019, 10, 310. https://doi.org/10.3390/info10100310

These are but a small range of papers in the spatial linked (government open) data domain and I have the feeling that, if these are missing, perhaps you may have missed some in other domains too (such as health care, for example).

Secondly, I think your paper may miss a critical reflection on:

1. The results:
The results are sometimes very surprising. For example, quite commonly used triplestores (such as GraphDB, Stardog) are not mentioned at all (Figure 4). What does this mean in practice then? You might conclude that these stores are not used in government linked open data projects however, this seems unlikely. So I am assuming that there might be something strange with your sample and, perhaps, may also be as a result of missing references.

As such, I would have liked to have seen some more statistics about the projects in the papers. How are they geographically divided? How many of them are academic/pilot projects? How many are maintained production data? How many triples (size of datasets)? And so on. This can also assist me in getting more of an understanding on the cases that are included in your sample.

2. The approach:
A key inclusion and critical reflection on your part should be how the approach could have influenced your results and whether another approach could have led to different results. For example, you mostly selected papers, conference papers or book chapters. Would the results have been the same if you had only selected journal papers?

Other more detailed issues/reflections that could improve the overall content of the paper:

- It is not clear to me why there is a focus on “open” government data. Why is this approach not valid to other types of government data?
- English/writing style: In particular, the abstract, discussion and conclusions need improvement. (For example: abstract line 18 “seem to not be adopted”, line 21 “very important”. First sentence discussion section, first sentence conclusions section.)
- P1 – line 45: “with this increase”. I have not read about an increase, what are you referring to here?
- P2 – line 7: “almost 200” worldwide linked government datasets? I would argue that this is almost nothing so do you think the LOD cloud is complete? In the Netherlands alone, we have about 10,000 open government datasets (in the metadata registry).
- P2 – Figure 1: this figure with 2018 as latest year is quite outdated.
- P3: the 4 (5) star is used in the text earlier than the description on this page.
- Table 2: for readability, I would have preferred to read the whole title of the paper and maybe some more metadata: such as domain, name dataset…production.
- P5 – line 16: why computer science? Management (or social) sciences would also be very appropriate.
- Fig 4: very informative table however, it gives me also the impression that many papers deal with academic experiments.
- Fig 5: Column on the right-hand side is vague.
- Based on the text I would have expected validation to be part of the methodology.
- First sentence conclusion: “Publishing LOGD is a very complex task”. This is out of context and creates more questions. Based on what do you make this conclusion? Did you study this? Is this a problem in the field? And based on your work, it is now simplified?

Review #2
By Fathoni A. Musyaffa submitted on 05/May/2021
Suggestion:
Accept
Review Comment:

Decision: Accept/Minor revision (?)

This paper presents a survey regarding a) steps typically involved in Linked Open Government Data (LOGD), b) different tools developed to support activities within the LOGD life cycle, and c) methodological evaluations of LOGD. It also proposes and elaborates the unified process model for publishing LOGD.

The paper is suitable as an introduction to the linked open data topic, especially for public institutions and early researchers in the domain. Despite its governmental coverage, the process model proposed can potentially be adapted for publishing linked open data in the other domains (e.g., life science). It is, however, strictly focused on the surveyed research papers. As these surveyed research papers do not discuss detailed topics regarding e.g., ontologies within different LOGD domains, this paper also does not cover this topic.

In case it is finally accepted, before publishing, there are several things to consider:

1. Some sentences are rather intricate and difficult to understand, sometimes due to the logical structure, or sometimes due to its lengthy compound formulation. The following are some examples of the affected sentence formulations:
"However, none of the papers, since 2016, cited a different reason for not existing a large scale production of LOGD."
"In [24] it is indicated, based on interaction with practitioners, that literature on publishing Linked Open Government Data (LOGD) has dealt with less complex, non-operational datasets and needs an engineering point of view, the identification of practical challenges and consider the organizational limitations."
"As an effect, although there exist many guidelines for publishing linked data on the Web, many producers do not have sufficient knowledge of these practices, having few studies detailing the whole process, leaving out the methods, tools, and procedures used [23], and proposing ad-hoc methods to produce linked open data, usually based only on the 4 principles with different interpretations on how to implement them."
"A look into the data.gov portal (from the US, with different national levels), shows that there is around 2.5% of datasets in RDF format 1, not explicit if they are in the 4th or 5th level, according to Berners-Lee’s classification[6]."
"For the design of URIs no tools were used, but guidelines, especially the Cool URI guideline13, which recommends practices on how to model instances using HTTP URIs."
"These non-functional requirements were diverse, comprising data quality aspects, compliance, accessibility, internationalization, caching management, security; however, none of the works described in detail how to deal with them using a systematic approach."
"We must emphasize that in this work we focus on the actual validation, as explicit in the papers."
"W21 employed two validations in their methodology: in the data clean-up phase, to check for RDF, accessibility, vocabulary, and data types mistakes or errors; and in the final of the linking phase, in which domain experts should revise the automatic links created with tools like SILK or LIMES."
"The task that requires most effort is arguably modeling the data, either by carefully selecting existing and validated vocabularies or by creating new ones, for each of the datasets and their distributions along time."

2. Grammatical/spelling errors:
“As different data sources my expose the same information in different representations, there is a need for a consensus on how to represent this data.” -- "may" instead of "my".
"...usually based only on the 4 principles...", "identified 3 recurrent problems by surveying LOD papers" - - generally small numbers are written as a textual numeral ("four" instead of "4").
"datasets in RDF format 1," -- superfluous space between the text and the footnote notation.
"the existence of inadequate2, links in the published dataset" -- superfluous comma.
"tools and guidelines for the definition of non-functional." -- a noun is needed, perhaps "requirements"?
"Adapted from [85]" -- missing period.
"to the steps previously identified11" -- missing period.
"(numbers taken from http://lod-cloud.net.)" -- missing period.
"in public administration each" -- missing comma.
"Also, verify that the complementary presentations are in place and working as intended." -- this looks more like an imperative sentence instead of a continuation from the previous sentence.
"and DataGraft 21" -- superfluous space.
"SEMIC) pilots 24" - superfluous space.

3. Minor improvement:
The paragraph "The great advantage of linked data is to reuse data from external data sources, to discover additional..." does not explicitly quote "link to other sources". This is in contrast to other paragraphs, which mention/quote other steps formally emphasized in italics.
The abbreviation of DWBP is used but never introduced in the manuscript.

4. (Important!) Presentations of Figure 3 and 4:
These figures are tables but referred to as figures. The current layout can be kept as they currently are, but consider using text-based PDF tables instead of attaching these as PDF images. This is done for two reasons: 1) searchability on PDF readers 2) maintenance of the print quality. Figure 4 is especially blurry when it is printed. If LaTex is used, a PDF export from spreadsheet programs can be imported in LaTex, so that it maintains the textual form of the table instead of image-based format.

5. Questions:
"Even though Semantic Web technologies based on this idea have flourished, only a small portion of the information on the World Wide Web is presented in a machine-comprehensible way (CSV, XLS, and XML files, in most cases)." -- is there a definition of machine-comprehensible/understandable data and could you recheck the difference between machine-comprehensible with machine-readable data?
Earlier versions of OpenRefine also enable the connection to relational databases, ..." -- does it mean that later versions do not support these features? this left the readers hanging regarding these features’ availability on later versionsOpenRefine.

6. Online appendix:
The survey spreadsheet is available in Google Sheets using the PURL link. As Google Sheets is not discoverable through e.g., search engines, other alternatives such as Figshare/Zenodo would be better. Please consider also exporting the sheets to PDF and upload these documents there, and refer to this in the paper.

Good luck!

Review #3
By Luis-Daniel Ibáñez submitted on 10/May/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

I was also asked to read the current re-submission "as a new one", as I was not involved in the review of the first submission.

Introduction:

Work is partly motivated by the importance of the government sector for Linked Open Data, which I agree. However, in my view the following paragraphs don't do very much for making the case of why we should study something as specific as methodologies for Linked Open Government Data and how is different from a Linked Open Data publishing methodology (without government).
One paragraph is about how few RDF datasets have been published, providing a citation on how open data publishing is evaluated according to a legal perspective and not about how useful it is, that begs the question, if the problem is about how the evaluation is done from a government perspective, then why we should care about the methodologies?
The next paragraph describes some problems identified with Linked Open Government Data methodologies. They are all fair and relevant problems, but how this survey/mapping helps to solve them? Does this drives in some way the research questions?
Next paragraph is about the quality of Linked Data based on an overall LOD study. It is again unclear how this relates to publishing methodologies and this particular study. Is it that we want to know if current methodologies leave out that step?

Finally, in terms of suitability as introductory text, I'm missing a sentence establishing for whom this survey is targeted. What will this target reader learn about the topic? How it is going to be useful for them?

Some of these questions find an answer at a later section, but for me they should clear from the introduction from a clarity perspective.

Background: Overall fair, it is clear that OGD creates data dis-integration and Linked Data seems to be a good way to tackle its integration. Something that could improve is a precise definition of what the authors consider a "publishing methodology".

Related works: Other systematic reviews are mentioned. At the end of this section there is an argument about the autonomy of public bodies to process and publish under their own norms. It looks like this was included to support the case for LOGD methodology, but publisher's autonomy also occurs on a general Linked Open Data setting, government is not a special case for this. It is in this section that we learn that one of the contributions is a general model based on the systematic mapping. This is relevant research, but is missing from the introduction, and missing how it helps the target audience of a survey paper.

Methodology: Standard systematic mapping, no problem here. I do have a remark on one of the exclusion criteria: "The study focuses on the aplication of LD in a specific domain". It is mentioned in the introduction that "adopters claim (LOGD methodologies) are too generic for use". One solution to this would be to develop domain-specific methodologies, aren't we missing relevant papers with that exclusion criterion? Smart Cities data (or at least a good subset of it) is managed by public sector and they seem to be excluded. I don't understand why genericity is highlighted as a problem in the introduction, but then some domain-specific works within a government context are excluded.

It is in this section that we get an answer to one of my questions on the introduction: Why quality is particularly highlighted.

Results:

The mapping work leading to the matrix on Figure 3 is very useful, however, the summary of each step lacks detail on how the descriptions of each step by each different paper were unified, or how they differ from each other. Most descriptions seems to cite only one of the selected papers, in some cases none. A particularly confusing example of the latter is the RDF cleaning step, that is "sometimes regarded as separate step" when according to Fig. 3 only one paper does. we are also unsure how many of the 24 papers that mention RDF conversion as a step do include cleaning as a subtask and what they mean by cleaning. Without this, we can't really answer if the steps are really in common or not.

RQ2 on tools does have a more comprehensive answer and a comparison of what tools are mentioned by what paper. My only remark in this subsection is that it is claimed that most tools are discontinued. It would've been good to know how many of them are discontinued, and when they were discontinued.

For RQ3, what is presented is fair. It seems that no methodology has been evaluated as a case study, and there is no evaluation with end-users (though this latter point should be made explicit)

For RQ4, a problem I find is how the terms "step" and "task" are used. There is already some confusion from previous subsections, but it is here where we feel it the most. At the beginning of this section, it is said that "studies divided the tasks of publishing into phases, and in turn, in more atomic steps with clear outputs", hinting that step is a part of a task. The Fig 3. description on the text talks about "explicit tasks identified", but the caption and column name says "step". Some descriptions in the answer to RQ1 are mentioned as "steps", and other as tasks (linking is referred as a set of tasks, hinting at tasks as part of steps, the opposite from the beginning of the section, cleaning is also referred as a task). In the RQ4 subsection, things are explained in terms of phases and tasks. Assuming tasks are the finest granularity, I'm missing a structured account of what are those quality control tasks that were found.

Unified publishing model:

This is per se a valid contribution, but I'm not quite sure that is consistent with the purpose of a survey paper in this journal. Furthermore,if this is seen as new "methodology" or as a "roadmap for LOGD initiatives", it has exactly the same problems as the works described in the mapping: It's quite generic (and we were told adopters don't like that), and it has not been evaluated beyond a logical argument. In this section it is said that "It can be used as a roadmap for LOGD initiatives and resource initiatives", hinting that this is targeted at practitioners. It is also said that managers may decide "the level of formalism according to their context". If this is the case, what is the minimum level of formalism? How mandatory/optional steps were decided? based on how they are labeled in the literature or following your won logical argument? There was a spotlight on quality tasks, a "validation step" is added after each phase, but this is not reflected on Fig. 5.

Discussion: In general I agree with it, however, there is still some unclear points about the proposed unified model. For example, "Some steps may be too expensive... in order to be implemented, a lean model is required. However, in our model we provide steps that should be considered in a formal initiative". I don't see the connection between step cost and the proposed model providing mandaotry steps, especially as there is no estimation of the cost of the proposed steps.

Research directions: Mostly agree, especially with the part on longitudinal studies.

Conclusion: It is said "deriving a unified methodology", again, inconsistent with the goal of a survey paper.

Overall my assessment is:

Pros:
* Methodologically correct sytematic mapping (caveat on one exclusion criterion)
*Valuable contribution on highlighting the fact that methodologies have not yet been properly evaluated.
*Valuable contribution on tools proposed for LOGD.

That covers the "How comprehensive and how balanced is the presentation and coverage" criterion. I also think is important for the Semantic Web community, with the caveat that it is not sufficiently motivated why we need to consider publishing methodologies on LOGD instead of only LOD.

Cons:

* Unclear suitability as introductory text, it is not explicit for whom this survey is.
* Interchange of step and task concepts hurts clarity.
* Goals of the paper vary across sections, intro starts with systematic mapping, conclusion talks about an "unified methodology"
* Answer to RQ1 not satisfactory due to lack of detail on how different definitions of each paper were harmonised. Alternatively, alack of an account of the different definitions.

Recommendation: Major Revision