A Systematic Literature Review on RDF Triple Generation from Natural Language Text

Tracking #: 3650-4864

Authors: 
Andre Regino
Anderson Rossanez
Ricardo da Silva Torres2
Julio Cesar dos Reis

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Survey Article
Abstract: 
We live in a big data era of unstructured data expressed as natural language (NL) texts. As the volume of text-based information grows, effective methods for encoding and extracting meaningful knowledge from this corpus are of paramount relevance. A challenging task concerns transforming NL texts into structured and semantically rich data. Semantic web technologies have revolutionized the way we represent and access structured knowledge. Resource Description Framework~(RDF) triples serve as a fundamental building block for this purpose, allowing the integration of diverse data sources. This survey examines methods for RDF triple generation and Knowledge Graphs (KGs) enhancement from natural language texts. This study area presents wide-ranging applications encompassing knowledge representation, data integration, natural language understanding, and information retrieval. Our systematic literature review addresses the understanding, characterization, and identification of challenges and limitations in existing approaches to RDF triple generation from NL texts and their inclusion into an existing KG. We retrieved, categorized, and analyzed $150$ articles from several scientific databases. We provide a comprehensive overview of the field, identify research gaps, and provide directions for future research. We found the most commonly available study categories, especially considering the domain, the targeted language, the public availability of datasets, and real-world applications. Our results reveal a growing trend in this field in the last few years relate to the use of transformer-based machine learning methods for triple generation. Our study also drives innovation by highlighting open research questions and providing a roadmap for future investigations.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/May/2024
Suggestion:
Major Revision
Review Comment:

# Paper Summary
The paper reviews the literature on RDF Triple Generation.
It starts with an introduction detailing the motivation for the paper. A gap for the paper, the combination of RDF triple generation and KG enhancement, was identified.
Then, it formally defines the triple generation problem and talks about problems accompanying this task.
Afterward, the survey methodology is specified which is split into three phases: preparation, execution and reporting.
This included identifying aforementioned gap in the literature, defining research questions, databases, database queries and inclusion/exclusion criteria.
Papers were retrieved according to the queries, read and either included or excluded. Furthermore, each paper was categorized. This led to 150 papers.
Then, the authors reviewed all 150 identified papers and deemed 15 fitting.
In the next, step the results of the literature analysis were reported. First quantitively based on metadata, then based on the cateogrization.
Lastly, the findings, answers to research questions and open research challenges were presented.

# Summary of Strengths
A survey investigating this field of research is indeed needed and important as the consideration of ontological information is usually ignored by both survey and method papers alike. Therefore, the paper is justified and especially the focus on RDF makes it a fit for the Semantic Web Journal.
The methodology section describes the all the steps taken to accomplish the literature review in a good way.
With the exception of the lack of entity linking mentioned, I think, it gives a good introduction of the topic.

# Summary of Weaknesses
In my opinion, there are major flaws in regard to multiple aspects:
First, the inclusion/exclusion criteria do not mention RDF at all. This probably led to the large number of papers excluded after the authors reviewed the papers identified by the master and Ph.D. students. A criterion mentioning RDF should therefore be added. EC-06 is also missing in Table 4.
If most of the 150 papers are filtered out, why are they still considered in section 4? “We confirmed
that not all the surveyed articles conformed to the specific focus of our study, which centered on the generation of
triples from NL texts, and their insertion into an existing KG, adhering to predefined ontology specifications.” sounds like they were not in the scope of the study. Please clarify that.
Could you please elaborate more on what Figure 8 depicts? It is not fully clear to me. Especially as the answer to RQ-02 relates to this Figure. I do not see how the Figure is connected to patterns.

I do not agree with the way the categories are defined. Why is there only transformers as an architecture? Are there no others such as RNNs or GNNs used? Those are still commonly used in relation extracting. If there are no such RDF Triple generation methods, this should be more clearly specified at some point. Especially as Technical Methodology is a group of categories containing only a single category.

Furthermore, I do not understand the categorization of the remaining 15 papers. Only six papers are mentioned in regard to language specificity (section 5.1). What about the rest? Are they using no language at all?
A similar problem is the case for other categories.

The subquestion "What are the most accurate one?” of RQ-03 was not answered. Yes, transformer based methods are the most popular ones but are they the most suitable as well? Either remove this subquestion or answer it.

Some of the open research challenges (section 6.3) do not follow from the rest of the paper. Some connection to the actual survey would be great. For example, the paragraphs on “Evaluation Metrics for Quality and Completeness” and the one on “Handling Ambiguity and Polysemy” need some evidence.

What is the importance of Table 3? In my opinion, this is already shown by the given query and the Table therefore just takes up space.

The survey category is not shown in Table 6.

Why is entity linking not mentioned once? Yes, NER is important but if one wants to include new triples in a KG, identifying whether an encountered entity already exists in it, is important as well.

# Questions
How was RQ-05 investigated? Just based on the papers or also in a different way?

# Summary
While I agree that the paper is a fit for the Semantic Web Journal, it has major flaws in regard to the stated inclusion criteria (which makes it hard to understand what papers are actually allowed to be included), the stated categories, the categorisation of the identified papers and the answers to the stated research questions. Furthermore, there are several inconsistencies in regard to categories showing up in one place but not the other. The presentation of the paper suffers from that.
These problems make a major revision necessary.

Review #2
Anonymous submitted on 18/Nov/2024
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Review #3
Anonymous submitted on 25/Nov/2024
Suggestion:
Major Revision
Review Comment:

The manuscript “A Systematic Literature Review on RDF Triple Generation from Natural Language Text” presents a relevant subject for the Semantic Web community: RDF data generation from text. Below the authors can find some comments.

(1) Suitability as an introductory text, targeted at researchers, PhD students, or practitioners to get started on the covered topic.

Across the first two sections (1-Introduction and 2-The Triple Generation Problem), the authors introduce the reader to the topic and problem of RDF data generation from text sources and highlight the relevance of its treatment.
The problem formulation and the typical pipeline for triple generation and KG enhancement based on NL texts provide the appropriate context to understand the topic, and through the "Motivating Example" (section 2.2) the reader can clearly understand the complexity and challenges of the task.
------------------------------
------------------------------
(2) How comprehensive and how balanced is the presentation and coverage?
The review covers a broad and representative range of existing approaches for generating RDF triplets from text. The methodology used for the study was based on the work of Budgen and Bereton, a well-known guide to conducting literature reviews.

The search was made using the most popular databases in the Computer Science area, and the authors clearly explained what the search strategy was, and what inclusion and exclusion criteria they used to select the articles related to the interest of the survey.

The analysis of the literature was carried out in two phases. In the first phase, some bar charts are generated to summarize certain characteristics of the 150 selected papers, such as articles by year, category, domain, database, type of evaluation, and others. Then, in the second phase, a more in-depth qualitative analysis of the 15 articles that are most closely related to the purpose of the review is carried out.

Regarding the methodology and the results obtained, several aspects need to be reviewed/improved.

- Table 3: Expression 1 explains which terms were chosen to carry out the search; therefore, what is the need to detail all combinations of terms in Table 3? If it could be explained in the text.
- Figures 3-7: It is suggested to use more appropriate colors for the figures, such as neutral colors.
- On page 12 (Reporting phase), it is mentioned that the paper describes "correlations" in the literature. No evidence of this has been found in the paper. The only relationship between categories is the one found in Figure 8 (Venn Diagram). However, the intersections found in the categories of the 15 papers analyzed could not be considered as correlations. Regarding Fig. 8, it might be more useful to use a bubble diagram to better understand how many papers fit into each category or share categories. Another more appropriate option might be to use a table to describe the 15 papers and the value they have in each of the categories analyzed (language specificity, ontology and KG enhancement, domain, technical methodology, etc.)

- Figure 2 presents the 15 steps of the methodology that the authors have followed. Section 3 explains how each step was carried out; however, the results of the review (Reporting Phase) are described in sections 4 to 6 and their names do not correspond to the steps indicated in that figure. That is, section 4 should be called Statistical Analysis, section 5 should be Research Questions Analysis, and section 6 should be called Open Challenges. In this way, each of the steps of the methodology would be followed (described in Fig. 2).

- In the description of step 10 of the methodology (page 11), the authors refer to Figure 7. Rather, should Table 7 not be referenced? Regarding Table 7, it enlists metadata that are not later used to summarize the characteristics of the articles, for example, "Country -> nationality of the authors" and Methodology.

------------------------------
------------------------------
(3) Readability and clarity of the presentation.

Unfortunately, although the manuscript addresses a relevant topic, the organization and presentation of the results should be significantly improved to maximize their impact and clarity. Some points related to the methodology and results have already been commented on in the previous, additional:

- Section 5 describes each of the 15 articles, through different subsections, but this does not allow us to have a broader or complete view of the advantages and disadvantages of each technical method used to generate RDF graphs from text sources.

- Section 6 briefly answers the competency questions, but in several cases, it does not provide specific information that could be important to identify the most valuable works, according to some criteria. For example, in RQ-05 it is stated that "we observed that a portion of the studies in our survey are not currently employed in real-world applications." ¿With portion, do you mean 20, 50, or 90% of the papers analyzed? ¿Could you add the citations to identify them?

In general, the manuscript language used is appropriate, but it requires the revision of some points. For example, <“Reporting Phase.”. >, , , .

------------------------------
------------------------------

(4) Importance of the covered material to the broader Semantic Web community.

Through sections 5 and 6, the authors identify the limitations of the analyzed proposals, highlight relevant gaps in the literature, propose future directions that can guide the development of the field, and identify the research challenges. In addition, the authors could expand the discussion, referring to the practical applicability (implementation) of the methods in real environments and the concrete applications in which KGs created from text could be useful, thus reinforcing the importance of this process.