InteractOA: Showcasing the representation of knowledge from scientific literature in Wikidata

Tracking #: 3369-4583

Authors: 
Muhammad Elhossary
Konrad Foerstner

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Tool/System Report
Abstract: 
Knowledge generated during the scientific process is still mostly stored in the form of scholarly articles. This lack of machine-readability hampers efforts to find, query, and reuse such findings efficiently and contributes to today's information overload. While attempts have been made to semantify journal articles, widespread adoption of such approaches is still a long way off. One way to demonstrate the usefulness of such approaches to the scientific community is by showcasing the use of freely available, open-access knowledge graphs such as Wikidata as sustainable storage and representation solutions. Here we present an example from the life sciences in which knowledge items from scholarly literature are represented in Wikidata, linked to their exact position in open-access articles. In this way, they become part of a rich knowledge graph while maintaining clear ties to their origins. As example entities, we chose small regulatory RNAs (sRNAs) that play an important role in bacterial and archaeal gene regulation. These post-transcriptional regulators can influence the activities of multiple genes in various manners, forming complex interaction networks. We stored the information on sRNA molecule interaction taken from open-access articles in Wikidata and built an intuitive web interface called InteractOA, which makes it easy to visualize, edit, and query information. The tool also links information on small RNAs to their reference articles from PubMed Central on the statement level. InteractOA encourages researchers to contribute, save, and curate their own similar findings. InteractOA is hosted at https://tools.wmflabs.org/interactoa and its code is available under a permissive open source licence. In principle, the approach presented here can be applied to any other field of research.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andra Waagmeester submitted on 04/Mar/2023
Suggestion:
Major Revision
Review Comment:

Interactoa is an intuitive GUI on regulatory networks for Wikidata. It is a welcome addition to the expanding family of GUIs on Wikidata. As a generic and scopeless knowledge graph, navigating Wikidata can be challenging. Its content - as the authors point out - is accessible through its SPARQL endpoint (WDQS). However, SPARQL comes with a steep learning curve and each SPARQL query can in principle be seen as the input for a distinct UI. Opening up the rich contents of wikidata requires many GUIs such as Interactoa to open up.

As such I welcome this paper which is well-written. I have visited the website and ran various queries which all returned results in an acceptable time frame. I also like how the authors provide examples that help in the first steps of exploring the tool.

However, there are two issues would like to see addressed before the paper can be accepted.

1. On page 8 on line 19 the authors argue "yet the actual insights and knowledge derived from this research are often only accessible in an unstructured format within the confines of the article text". I am wondering if Wikidata can change this. Pubmed currently contains 35 million citations [1] and google scholar is said to contain 100 million documents, which is estimated to be 88% of the total body of the available scientific literature [2]. Wikidata currently hosts just over 100 million items [3]. Also, the estimate of 100 million on the size of google scholar was from 2014. So, currently, there are more scientific articles than there are wikidata items. Storing structured data on actual insights and knowledge derived from the research on Wikidata would require a much larger body than currently stored on Wikidata on all topics combined. The current limitation imposed on wikidata in terms of size also imposes a limitation on what can be covered by tools such as the one described in this paper. Since the authors argue: "Having demonstrated the general usefulness of this approach, we now intend to develop the application further 31
32 to tap into its significant and, as yet, untapped potential." The paper could use some lines on how the authors plan to deal with the size limitation imposed by wikidata.

2. Regarding: "1.2. Insufficient management of data, information, and knowledge in the life sciences"
This section suggests that most information remains stored in an unstructured manner. This is a bit unfair to related resources such as UniProt [4] (which coincidentally is stored on an RDF graph substantially bigger than the one of Wikidata), intact [5] and a few more [6].

[1] https://pubmed.ncbi.nlm.nih.gov/about/ (website visited 2022-03-04)
[2] https://www.science.org/content/article/just-how-big-google-scholar-ummm (website visited 2022-03-04)
[3] https://www.wikidata.org/wiki/Wikidata:Statistics#:~:text=How%20big%20is....
[4] https://sparql.uniprot.org/
[5] https://www.ebi.ac.uk/intact/home
[6] https://yummydata.org/endpoint

Review #2
Anonymous submitted on 29/Mar/2023
Suggestion:
Major Revision
Review Comment:

Summary

This article presents a use case of representing scholarly knowledge in Wikidata. The knowledge has been manually extracted from various sources and are ingested in Wikidata. Furthermore, the authors present a web interface, called InteractOA. The interface can be used to explore the ingested data and provides customized visualizations tailored towards the use case. The authors present a use case from life sciences, specifically modeling small RNAs.

General remarks

The article is of relevance for the Special Issue on Wikidata, as it demonstrates how Wikidata can be used to represent scholarly knowledge. The article is well written and mostly easy to follow. The paper would benefit from describing the novelty of the approach in more detail as Wikidata has been used by others as well to represent scholarly knowledge. Furthermore, the structure of the article can be improved. The current sections (introduction, results and discussion) do not cover the content well enough. The first subsection (1.1) provides rather background information than an introduction to the article. Only subsection 1.5 gives an introduction, the sections in between mostly present related work and background information. While this is a matter of personal preference, I think the article benefits from separating this into an introduction and related work. Similarly, the results section seems to be more an "approach and implementation" section.

The authors do not give a justification why Wikidata is the best graph to ingest the extracted scholarly knowledge. Furthermore, as the authors mention in the discussion, Wikidata only provides high-level properties to describe knowledge, which seem to be a severe limitation for this approach. Requesting new properties is not only time-consuming, but probably also not feasible at a large scale. In case other types of scholarly knowledge have to be describe, the Wikidata ontology might be too limited to do so. A more detailed discussion is necessary here.

Other comments

* In the abstract is written "InteractOA encourages researchers to contribute, save, and curate their own similar findings", it does not become apparent in the remainder of the article how researchers are encouraged to do so.
* From section 1.2 "Although scholarly articles are now available in digital formats (e.g. in HTML, PDF and XML)". I would weaken this sentence, HTML and XML formats are getting more popular, but unfortunately many articles are still only available in PDF format (and there is a difference in the machine-readability between those formats).
* A citation would be helpful in the last paragraph of section 1.2 (regarding the citations on article level and not on sentence level).
* It is mentioned several times in the article that Wikidata prioritizes data preservation, but no source or reference is provided. How do they accomplish this? Compared to data repositories such as Zenodo, most likely Wikidata is less suited for data preservation?
* It is mentioned that InteractOA is a "user-friendly graphical interface", but no evaluation has been presented to assess the user-friendliness. I would either remove the words "user-friendly", or better, provide a user evaluation of the interface.
* Data was (presumably manually?) obtained from "numerous research articles, RegulonDB [38], and other sources". A more in-depth description of the data collection process is missing. Also, how much data is extracted? Some additional statistics would be helpful.
* A more detailed description of the future pipeline of information extraction would be helpful to determine whether the approach can be applied in real-life environment. Who are the human curators of the data? What is their incentive for data curation? Do they need to be familiar with knowledge graphs? Are the curators researchers themselves, if not, how to ensure they have the required domain knowledge?
* Some technical details regarding the highlighted sentence are missing (i.e. mentioning the browser "text fragment" functionality is used might be helpful to better understand how the highlighting is accomplished). Also, seems like it is not working correctly in Firefox. When using Firefox, the user informed that the browser is not compatible with "some functionalities". In this case it is probably better to be specific and explain to users what is not working exactly.

Minor and styling

* The acronym in the system's name "InteractOA", "OA" is never spelled out.
* Missing space after citation at the end of the first paragraph "various domains and research areas[4]".
* Inconsistency: some figure captions have a sentence ending (period), other do not.
* Figure 2 is not referenced in the text.
* Would it be possible to use the actual citation of the article in the caption of figure 3?
* Typo section 2.1: "they are available on GitHub and at …", I assume "and" is redundant here, as the listed links are already Github links.
* Figure 4 is mentioned before Figure 3, probably it makes sense to swap them.

Review #3
Anonymous submitted on 06/Apr/2023
Suggestion:
Reject
Review Comment:

This paper presents the tool InteractOA that provides a visualization of information about small regulatory RNAs, their interaction with genes and the paper that provides the reference of the interaction. This information is represented using Wikidata resources. The source code is available on GItHub and Zenodo. The idea behind this work could be a good step towards improving the life sciences data available in the web, but I have some major concerns that need to be address before the publication of this work.

First of all, regarding the writing, motivation and organization of the paper, there is plenty of room for improvement. The paper would improve largely by organizing it differently rather than just plainly Introduction, Results and Discussion. The Introduction is too long, the motivation gets lost throughout this immense section. There are a lots of useful references, descriptions and examples that would fit best in a “Related Work” section, which the paper is missing. Adding similar approaches would also be nice, aren’t there any similar works previously done that are worth mentioning?

Regarding the Results section, are they indeed results? There is no evaluation whatsoever, and what it is described is the tool. In addition, how it is described is messy and by the time the section ends it is not clear how the data has been transformed or what is exactly the input or output. Section 2.1 tries to describe the data modelling, but mixes that with the data transformation, which are two completely different things. The input data is not described at all (what exactly are GFF files and how are they? Only experts in the domain know this kind of format), neither its provenance (only mentions in last line of page 4 “numerous research articles, RegulonDB [38], and other resources”) –which other resources? How are these very different resources integrated? This section needs a reorganization, and could use a general workflow diagram to help the reader understand each step.

There are also plenty of statements without evidence to sustain them, for instance “and shortens the time needed to consult previous research” or “Having demonstrated the general usefulness of this approach” –on what evidence are these statements based on? Again, there is no evaluation, testing or validation presented, then how are authors sure this is true? There is also no mention of the impact of the tool according to the criteria of the “Reports on tools and systems” kind of papers – “demonstrable uptake of your work by the research community, industry, governments, or the general public”.

A minor final remark about writing, the references to the figures are too far from the actual position of the figure, and Figure 2 is not even referenced. The links to the tools and GitHub repositories should appear as footnotes the moment they are mentioned, there is a part in the text that mentions 2 different tools that aren’t even distinguished by name, and their corresponding links appear at the end of the section.

Regarding the tool, the graph visualization and the table with the reference work and look nice. The graph visualization contains not so many interactions so it’s feasible to see information, how are authors planning to manage this when there are more data inside? There are millions of RNA-Gene interactions. Another thing that concerns me is that, despite the emphasis in the paper regarding representing the information according to Wikidata, in this visualization I see nothing related to it, I’m referring specifically to the Wikidata identifiers. What is the point of using Wikidata if the output data provided in the end doesn’t reflect it? In which sense is then taking advantage of it and how is improving the original data?

When clicking in the blue concepts, no information appears; and there are relationships like “instance of 2” – what is 2? The graph visualization doesn’t link to the information regarding the provenance where it was extracted from, and thus, all the information stored in the table shown when clicking in “View citations”. In this table, again, I see little reference on the modelling according to Wikidata that authors have performed. And how has this information being extracted, modelled and transformed? Moreover, the rest of the visualizations graphics add no information, it is not clear or not well represented (bar plots, etc).

Additionally, there are two things that I think would be useful for the webpage, (1) providing the transformed data modelled according to Wikidata available for download (as the Wikidata dumps) and (2) a visualization of the transformed data presented as wikidata pages, which is also convenient for user explorations.

The original idea of the paper is fine, but IMO the presented work is not mature enough to be published, it needs major rework both on the system itself and the organization and writing of the paper.