On General and Biomedical Text-to-Graph Large Language Models

Tracking #: 3642-4856

Authors: 
Lorenzo Bertolini
Roel Hulsman
Sergio Consoli
Antonio Puertas Gallardo
Mario Ceresa

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Knowledge graphs and ontologies represent symbolic and factual information that can offer structured and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in complex processes such as human reasoning. While Large Language Models (LLMs) are known to be useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar domains. In this work, we provide a large-scale comparison of the performance of certain LLM features (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL)) on text-to-graph benchmarks in two domains, namely the general and biomedical ones. Experiments suggest that, in the general domain, small fine-tuned encoder-decoder models and mid-sized decoder-only models used with iCL reach overall comparable performance with high entity and relation recognition and moderate yet encouraging graph completion. Our results further tentatively suggest that, independent of other factors, biomedical knowledge graphs are notably harder to learn and better modelled by small fine-tuned encoder-decoder architectures. Pertaining to iCL, we analyse hallucinating behaviour related to sub-optimal prompt design, suggesting an efficient alternative to prompt engineering and prompt tuning for tasks with structured model output.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Fidel Jiomekong submitted on 20/May/2024
Suggestion:
Minor Revision
Review Comment:

(1) Originality

Acquiring structured knowledge from text for the purpose of ontologies and/or KG construction is a challenging task. In this paper, the authors highlighted the fact that related work on extracting and manipulating symbolic knowledge is being mainly focussed on comparing architectures-specific models. Thus, they compare the performance of LLMs on text-to-graph benchmarks, considering other dimensions. The results obtained by the authors make this work original. In fact, the authors found that small and mid-sized models in which in context learning is considered may reach good performance. This finding is important as large models are difficult to scale, which may limit its adoption in contexts in which resources for fine-tuning large models are not available. In addition, the authors use existing open source LLMs, which make this work reproducible by fellow researchers.

(2) Significance of the results
The authors applied their experiments on two challenging datasets, and considered four open source LLMs with different sizes. The results obtained by the authors are significant.

However, it would be good if the authors can compare these results to existing results (presented in the state of the art section). Even if the results of the authors are not high than some existing work, the fact that the authors show that small LLMs can achieve good results is considerable.

(3) Quality of writing
In the related work, the authors are very vague on the presentation of related works. It would be good if the authors compare these works using several dimensions already mentioned in the paper (e.g., short description of the model, model architecture, knowledge extracted, benchmark considered, limit of models proposed, etc.) In addition, a table comparing related work with these elements will add more value to the paper. They can provide one table for the datasets and one for the models. The claim: “the literature does not offer a systematic experimental comparison of the efficiency of contributing factors in an end-to-end text-to-graph task” and “To the best of our knowledge, this work proposes a first quantitative investigation of the abilities of both encoder-decoder and decoder-only models, without the aid of any prompt-construction resource.” is difficult to measure without such a table. Concerning the latter sentence, the authors should add a column concerning their own work.

In Section 3, the authors present the experimental environment they are using and provide tables describing the datasets, which is good. It would be good to present such a table describing the models used and comparing these models as presented in the text.

Minor errors:
- Page 6: last line: “by Hugging Face.12 -> by Hugging Face1,2.
- Page 8: “T5 The T5 family” -> “The T5 family”
“BART We make use of the BART model (facebook/bart-large3 ) introduced in [46].” -> “BART model … was introduced by Lewis et al. [46].
“Llama-2 Upon release, the Llama-2 …” -> “Upon release, the Llama-2 …”.
“Mistral-v0.1 A more recent introduction, the Mistral-v0.1” -> “A more recent introduction, the Mistral-v0.1”
Page 9: “is given in Figure 1 below” -> “is given in Figure 1”
Page 10: “Note that for Web NLG we compute” -> “Note that for Web NLG, we compute”
Page 12: Figure 3 caption: “At 406M and 7000 M learnable” -> “At 406M and 7000M learnable”

Question
The authors claim: “There are a vast amount of differences with the T5 family, including the pre-training corpus, tokenization …” - This is very vague for people who are discovering the two domains. What are these differences? The authors should highlight.
There are no baselines to which the authors can compare their results? I propose to the authors to compare their results in a table with the results presented in the related work section.

Review #2
By Finn Årup Nielsen submitted on 28/May/2024
Suggestion:
Minor Revision
Review Comment:

The manuscript deals with information extraction of relations from
texts. The paper works with two datasets, Web NLG and Bio Event and
test several language models for the information extraction task.

It is generally a sound and well-written manuscript with some
interesting results and I see no reason for a major revision.

A better manuscript would have applied more datasets and used larger
language models with thorough prompt engineering. The best models at
the OpenAI GPT4(o) models and similar and these have not been
applied. The authors have restricted themselves to what could run on
their local RTX 8000 GPU hardware. Even thorough experimentation of
prompt engineering has not been done. That being set I still see that
the manuscript have merits with the restrictions they have selected.

I have a few minor issues:

Here and there in the manuscript the authors claim that some of the
results follow a power-law scaling (e.g., p 15 line 4). To support a
power-law scaling I would say that the data should show the law across
several orders of magnitude. In their case it is just from 60M to
738M and with not statistical test to support the claim. I suggest
that the author removes the claim(s) or considerable moderate the
claim(s) found throughout the manuscript. Removing the
claim has no influence on the overall results and conclusions of the
paper. An example where claim appears is p. 2 line 45. Also when
examined across architectures/models there is no power law, see Fig. 3.

The abstract claims that "extracting and manipulating" data in
knowledge graphs and ontologies "is a crucial step in complex
processes such as human reasoning". The authors cannot claim that human
reasoning work with a basis in knowledge graphs (I do not think we
have a knowledge graph and ontology in our head). The author could
remove the part of the sentence "such as human reasoning".

Typo in p.4 line 12: GhatGPT

The authors observe problems/errors in the dataset they use and this
is interesting. The authors should provide an estimate of how serious
this is for the datasets. For instance, is it a negligible part well
below 1 percent or is it a double digit percentage seriously affecting
the dataset? I am not requesting a deep analysis just a very rough
probable estimate based on a few samples with a small note.

For RTX 8000 provide the full manufacturer and model name: I suppose it
is NVIDIA Quadro RTX 8000? p. 9, line 38.

P. 11 line 37: T5 is referred to as fine-tuned decoder-only
models while in the table the T5 model is grouped as a
encoder-decoder model.

For panels in Fig. 3 I would not add a line between 4th and 5th point in the
plot because they are separate families (T5 and BART).

Using ROUGE as a metric for evaluating the performance of the
relationship extraction is surprising to me. It is often used in
summarization system evaluations, but I do not recall it it for
relationship extraction. Could the authors provide previous example(s)
of the use for this task. It is not clear how precisely it is computed and
a scoring of a short relationship extraction example would benefit me
as a reader. For instance, is ROUGE-2 and ROUGE-L dependent on the
ordering of the extracted triplets? If that is the case there would
need to be a note about that in the manuscript.

Review #3
Anonymous submitted on 16/Aug/2024
Suggestion:
Minor Revision
Review Comment:

The paper presents an analysis of various large language models (LLMs) and optimization techniques for the text-to-graph task. The authors evaluated a range of modern LLMs on two benchmarks—Web NLG (generic) and Bio Event (biomedical). The study includes both encoder-decoder and decoder-only architectures and compares fine-tuning with in-context learning approaches.

Overall, the paper is well-written, though certain sections require further elaboration. For example, Section 3.2 would benefit from additional clarification (see specific comments below).

The analysis is robust and the results are insightful. Code and data are made available in a public repository. The discussions on in-context learning (Section 4.2) and hallucination (Section 4.2.1) are well-executed. The findings, though not unexpected, are novel and should be valuable to the research community working in this domain. I consider this a solid submission that the journal should accept after revisions and improvements.

Below, I provide specific comments and questions.

Section 1:
I recommend adding more details about the benchmarks and the specific tasks being performed. This will help clarify how the models are being tested. For instance, Insight #3 mentions "both benchmarks," but the benchmarks have not yet been introduced at that point.

Section 2.
I suggest including references to recent efforts that use text-to-graph methods to build knowledge graphs in the scientific domain. Examples include ORKG [1], CS-KG [2], LLMs4OL [3], and others.

[1] Kabongo et al. 2024. ORKG-Leaderboards: a systematic workflow for mining leaderboards as a knowledge graph. International Journal on Digital Libraries, 25(1), pp.41-54.
[2] Dessí et al. 2022. SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. Knowledge-Based Systems, 258, p.109945.
[3] Babaei Giglou et al. 2023. LLMs4OL: Large language models for ontology learning. In International Semantic Web Conference (pp. 408-427). Cham: Springer Nature Switzerland.

Section 3.
“However, additional comments are required.” It is unclear why this is necessary. I assume that comments are mapped to triples to find pairs of text/triples, but this needs to be stated explicitly.

The mention of "78 knowledge graphs" in the BioEvent dataset for the graph-to-text task seems excessive. I assume these are not complete knowledge graphs but rather sets of triples extracted from a large text. This should be clarified to avoid confusion.

It would also be helpful to include an assessment of a sample of items in the BioEvent dataset to better understand the frequency of the type of problematic situations discussed on page 6.

Section 3.3.
In addition to ROUGE scores, it would be beneficial to report the exact match rate, assuming a sufficient number of matches exist.

Section 4.
It is unclear why only decoder models were trained, excluding models like Mistral and LLaMA. Was this due to computational limitations? Please clarify.

Table 2.
Please highlight the best results in bold for clarity.

Review #4
Anonymous submitted on 18/Aug/2024
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The main contribution of the work is a systematic experimental comparison of various model architectures, sizes, and task learning methods. It focuses on developing an end-to-end Large Language Model (LLM)-based knowledge graph extraction system from textual sources. The study offers empirical evidence comparing fine-tuning and in-context learning (iCL) methods, demonstrating that fine-tuning generally yields better performance, especially in complex domains like biomedicine. The key aspects of this contribution include: 1. Comparative Analysis, 2. Insights on Performance 3. Recommendations for Practitioners 4. Heuristic for Hallucination Control.

This work's potential impact on knowledge graph extraction and the broader landscape of natural language processing (NLP) is significant.

Some negative aspects and difficulties that may arise when attempting to apply the methodology, along with considerations for addressing these challenges:
1. The absence of a well-defined, step-by-step methodology can make it difficult for practitioners to replicate the approach or adapt it to their specific needs
- For this reason, I suggest developing a more structured methodological framework that outlines the processes involved in applying the proposed techniques. This could include detailed patterns of prompts and a combination of the ICL (zero vs two-shot learning) and the inclusion or not of other tasks.
2. The research discusses various model architectures, sizes, and task-learning methods, which can create confusion for practitioners who may not have extensive experience in model selection or configuration.
- Providing flowcharts that guide users through the model selection process based on their specific requirements (e.g., computational resources, domain specificity) could simplify the decision-making process. An example except those described for the scenarios should be used throughout the paper.
3. While the methodology may be generalizable, the specific challenges associated with different domains (e.g., biomedical, legal, etc.) can complicate the application of the proposed techniques. Each domain may have unique data characteristics, terminology, and requirements that need to be addressed.
4. The research assumes access to specific computational resources (e.g., a single RTX 8000 GPU), which may not be feasible for all practitioners. This limitation can hinder the ability to implement the proposed methodologies effectively.
- A possible consideration could include strategies for optimizing performance on less powerful hardware or utilizing cloud-based solutions.
5. The findings and recommendations may require further validation across a broader range of datasets and tasks to establish their robustness and reliability. Without extensive testing, practitioners may be hesitant to adopt the methodologies.
The paper outlines several constraints and assumptions that may limit the applicability of its findings and methodologies for knowledge graph extraction using LLMs:
- The research assumes access to specific computational resources, particularly a single RTX 8000 GPU. This constraint may limit the applicability of the proposed methods for practitioners with less powerful hardware, as the performance and scalability of the models can be significantly affected by the available computational power
- specific model architectures (encoder-decoder and decoder-only) and sizes (small to mid-sized models). The findings may not generalize well to other architectures or larger models, which could behave differently regarding performance and efficiency.
-  The results may not apply to other learning paradigms or hybrid approaches that combine elements of both methods. Additionally, the performance of iCL is noted to be weaker than fine-tuning, which may discourage its use in scenarios where fine-tuning is not feasible
- The assumption that high-quality, representative datasets are available may not hold in all cases, particularly in niche domains where data is scarce or poorly structured.

Task-Specific Metrics: Creating task-specific evaluation metrics that align with the goals of knowledge graph extraction could provide more meaningful insights into model effectiveness. They can be designed to account for the nuances of "complex" relationships, such as hierarchical or temporal dependencies, which are crucial for accurately representing knowledge.