Review Comment:
The paper presents an analysis of various large language models (LLMs) and optimization techniques for the text-to-graph task. The authors evaluated a range of modern LLMs on two benchmarks—Web NLG (generic) and Bio Event (biomedical). The study includes both encoder-decoder and decoder-only architectures and compares fine-tuning with in-context learning approaches.
Overall, the paper is well-written, though certain sections require further elaboration. For example, Section 3.2 would benefit from additional clarification (see specific comments below).
The analysis is robust and the results are insightful. Code and data are made available in a public repository. The discussions on in-context learning (Section 4.2) and hallucination (Section 4.2.1) are well-executed. The findings, though not unexpected, are novel and should be valuable to the research community working in this domain. I consider this a solid submission that the journal should accept after revisions and improvements.
Below, I provide specific comments and questions.
Section 1:
I recommend adding more details about the benchmarks and the specific tasks being performed. This will help clarify how the models are being tested. For instance, Insight #3 mentions "both benchmarks," but the benchmarks have not yet been introduced at that point.
Section 2.
I suggest including references to recent efforts that use text-to-graph methods to build knowledge graphs in the scientific domain. Examples include ORKG [1], CS-KG [2], LLMs4OL [3], and others.
[1] Kabongo et al. 2024. ORKG-Leaderboards: a systematic workflow for mining leaderboards as a knowledge graph. International Journal on Digital Libraries, 25(1), pp.41-54.
[2] Dessí et al. 2022. SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. Knowledge-Based Systems, 258, p.109945.
[3] Babaei Giglou et al. 2023. LLMs4OL: Large language models for ontology learning. In International Semantic Web Conference (pp. 408-427). Cham: Springer Nature Switzerland.
Section 3.
“However, additional comments are required.” It is unclear why this is necessary. I assume that comments are mapped to triples to find pairs of text/triples, but this needs to be stated explicitly.
The mention of "78 knowledge graphs" in the BioEvent dataset for the graph-to-text task seems excessive. I assume these are not complete knowledge graphs but rather sets of triples extracted from a large text. This should be clarified to avoid confusion.
It would also be helpful to include an assessment of a sample of items in the BioEvent dataset to better understand the frequency of the type of problematic situations discussed on page 6.
Section 3.3.
In addition to ROUGE scores, it would be beneficial to report the exact match rate, assuming a sufficient number of matches exist.
Section 4.
It is unclear why only decoder models were trained, excluding models like Mistral and LLaMA. Was this due to computational limitations? Please clarify.
Table 2.
Please highlight the best results in bold for clarity.
|