Large Language Models for Creation, Enrichment and Evaluation of Taxonomic Graphs

Tracking #: 3751-4965

Authors: 
Irina Nikishina
Viktor Moskvoretskii
Alina Lobanova
Ekaterina Neminova
Alexander Panchenko
Chris Biemann1

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
Taxonomies play a crucial role in organizing knowledge for various natural language processing tasks. Recent advancements in LLMs have opened new avenues for automating taxonomy-related tasks with greater accuracy. In this paper, we explore the potential of contemporary LLMs in learning, evaluating and predicting taxonomic relations across multiple lexical semantic tasks. We propose novel method for taxonomy-based instruction dataset creation, encompassing multiple graph relations. With the use of this datasetwe build TaxoLLaMA, a unified model fine-tuned on datasets exclusively based on English WordNet 3.0, designed to handle a wide range of taxonomy-related tasks such as Taxonomy Construction, Hypernym Discovery, Taxonomy Enrichment, and Lexical Entailment. The experimental results demonstrate that TaxoLLaMA achieves state-of-the-art performance on 11 out of 16 tasks and ranked second on 4 other tasks. We also explore LLM ability for constructed taxonomies graph refinement and present comprehensive ablation study and thorough error analysis supported by both manual and automated techniques.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 04/Nov/2024
Suggestion:
Minor Revision
Review Comment:

This paper presents a methodology for extracting taxonomies using large language models. The methdology itself is broadly prompt engineering and fine-tuning, but extensive evaluations in this paper make this an interesting analysis for an important task. The detailed analysis and state-of-the-art results compared to many baselines means this paper will be interesting to wide range of researchers working on knowledge graphs.

I have a couple of concerns about how the task is set. Firstly, the authors claim that "animal" is an incorrect hypernym of "Maltese" in p8. In fact, it is a indirect hypernym in WordNet and as such it seems odd to penalize the system for this. I am not sure how it biases the results, but it would be important to see results with indirect hypernymy considered as well, for most of the results, especially when comparing to other systems. This is particularly important as the authors have not presented the F&M analysis of the original TexEval work that includes indirect taxonomies. Further, in Fig 5, it is reported that 75% of errors are due to overly-broad hypernyms. If this is all indirect hypernyms, then it is likely that this is a large miss-estimation of errors in the system.

Secondly, the authors define several tasks based on hyponym prediction, this is a task that is basically impossible as it is underspecified, as such the authors report very poor results, however it is likely that the systems are suggesting valid hyponyms that are simply not in wordnet. A simple fix would be to discount any predictions of terms not in WordNet, however a proper manual evaluation is really required here to see if predictions are valid hyponyms. The further analysis in Sec 6.1 is quite ambiguous for this reason and I suspect that the results are more affected by another factor (e.g., the instances such as 1bD represent rare terms).

The prompt on p3 starts with "You are a helpful assistant", some authors have recommended explicitly against such prompts [1] and more specific persona instructions have found to be helpful [2]

Authors refer to "English WordNet-3.0"... The (Open) English Wordnet project has not made such a release, and the resource the authors are referring to is "Princeton WordNet". The authors should comment on why they are using an out-of-date release of English WordNet rather than those of the OEWN project.

p6 l29-30 I don't think you need to explain a `pop` operation

The authors evaluate on WordNet (via TexEval) in Sec 5.1, but do not describe the steps they have taken to ensure that the test data was not seen during training.

Please take care with double quotes throughout the article. It appears that the article was prepared with LaTeX, in which case your source should not have " anywhere.

I wouldn't capitalise the tasks defined in Section 5 (taxonomy construction, etc.). They are not proper nouns.

p12. "Lexical Entrailment"

I would recommend making Figure 1a and 1b two separate figures and place Figure 1b near where it is referenced in the text (Section 6). It would also alleviate the ugly references like "1aA".

I am not sure why you did a manual evaluation in S6.1.1, rather than just using a simple proxy like term frequency in a reference corpus?

I note that the "Long-term Stable Link to Resources:" is to another submission. As such, it does not support the reproduction of the resources in this paper.

[1] https://medium.com/the-generator/the-perfect-prompt-prompt-engineering-c...
[2] https://www.dre.vanderbilt.edu/~schmidt/PDF/Evaluating_Personified_Exper...

Review #2
Anonymous submitted on 06/Jan/2025
Suggestion:
Reject
Review Comment:

Recent advancements in large language models (LLMs) have improved the automation of taxonomy-related tasks, which are essential for organizing knowledge in natural language processing tasks. In this paper authors describe TaxoLLaMA, a model fine-tuned on English WordNet 3.0, designed to handle tasks like Taxonomy Construction, Hypernym Discovery, Taxonomy Enrichment, and Lexical Entailment. TaxoLLaMA achieves state-of-the-art performance in 11 out of 16 tasks and ranks second in 4 others.
Authors acknowledge in section 1, p. 3, l.7 that the manuscript is an extension of references [59] and [60] and state that this submission includes the following novelties:
- They examine the ability of LLMs to resolve graph cycles using learned relations and explore the benefits of this procedure.
- They investigate how LLMs can leverage multiple relations to refine an already constructed graph.
- They explore the advantages of utilizing bidirectional relations to enhance the refinement of constructed taxonomies.
- They extend their Taxonomy Construction results to include the Food subset.
Nevertheless, I failed to find in the paper a clear focus on these novel aspects. As indicated below, the key elements in the dataset, model, experiments and results are yet published in refs [59] and [60] and no relevant experimental results regarding these novel aspects are described in this manuscript.
Furthermore, the following contradictory statements are found in the paper, which need clarification:
- Authors indicate as novelty that “They extend their Taxonomy Construction results to include the Food subset, but the Food subset is also considered in paper [60] for the Taxonomy Construction task.
- Authors indicate (p.3, l. 6) that they “also make data, code and models publicly available” in footnote 3: https://github.com/uhh-lt/lexical_llm But this link points to paper [59]: Are Large Language Models Good at Lexical Semantics? A Case of Taxonomy Learning, published at LREC/COLING 2024.
Therefore, in general, I cannot find enough novelty in the paper to warrant a new publication.
In terms of originality, authors should make clear in the paper which are the novelties regarding their previous papers [59] and [60], since all the key details about the model are described there. Some (non-exhaustive) examples follow:
- section 2. “Related Work”: all the contents, except paragraph “Taxonomics & LLMs” are described in [60]. Authors should have focused this part on the novelties of their approach.
- section 3 “Methodology”: section 3.1 “Dataset collection algorithm” is almost identical to corresponding sections in [59] and [60]. This includes Figure 1a and Figure 1b (identical in [59]), columns 1 and 2 in table 1, the formal algorithm in section 3.1.1 (also in [59]), the model fine tuning section 3.3, table 2, …
- section 4 “Instruction Taxonomy Learning Results” which is almost identical to section 3.4 in [59] (figure 2 is new)
- section 5 “downstream tasks application” related to the tasks considered (taxonomy construction, Hypernym Discovery, Taxonomy Enrichment, Lexical Entailment. Except for column “TaxoLLaMA_multi” in Table 3, all the other descriptions and results are described identically in:
- table 3 in [60], for the taxonomy construction task
- Table 1 and Figure 3 in [60] for the Hypernym Discovery task
- Table 2 in [60] Taxonomy Enrichment task
- Tables 4 and 5 for the Lexical Entailment task
- …
- section 6 “Ablation Study” includes table 7, and 8 which are identical to tables 3 and 4 in [59]. Here, section 6.2 dealing with “Self-refinement for constructed graph” seems to be a novel contribution in this paper.
- section 7 “Error Analysis” includes figure 5 (identical to figures 4 and 5 in [59]), and table 12 (identical to table 6 in [60])
- Finally, all appendices A-E are identical to Appendices A, B, C, D and F in [60].

Globally, the part corresponding to Graph Cycle Resolution and Refinement seems to be the novel contribution of the paper. Authors should focus on this and provide experimental evidence about how this can improve their previous approaches. All the repeated content should be dropped or very heavily summarised and referred.

Finally, regarding the URL provided for resources authors should make clear which of these resources relate to this manuscript and which other to the previous papers [59] since the provided link points to paper [59]: Are Large Language Models Good at Lexical Semantics? A Case of Taxonomy Learning, published at LREC/COLING 2024.

Review #3
By Pablo Calleja submitted on 17/Feb/2025
Suggestion:
Minor Revision
Review Comment:

This paper explores the potential of Large Language Models (LLMs) to understand, learn, and work with taxonomies through their concepts and relationships. It proposes a methodology for constructing a dataset from a taxonomy for various tasks. This methodology is applied to WordNet. The resulting dataset is used to train a model called TaxoLLama, which is evaluated on several state-of-the-art tasks, achieving improved results in most cases. The paper also includes an ablation study and an error analysis. Additionally, the authors provide the source code and appendices to detail some of the processes involved in the work. This study is a continuation of two previous works.

The paper is innovative and the methodology for creating a dataset based on a taxonomy is highly relevant to this domain. Moreover, it can be reused and adapted for other taxonomies and languages. The fine-tuning process used to train the TaxoLLama model is also significant, as it enables the integration of two fields—LLMs and graphs—to facilitate their joint use. This is a particularly relevant research area in the current state of the art. Furthermore, the results are well-supported and validated through extensive evaluations on taxonomy/graph-related state of the art tasks.

The paper is well-structured and clearly presents its ideas. At the end, I provide minor comments on some issues found. Certain aspects need improvement for acceptance.

The GitHub repository is well-organized and contains the necessary information to review the work. However, some improvements are suggested, such as specifying the Python version used in the study and hosting the dataset on Zenodo instead of Google Drive. The repository should also clearly distinguish itself from previous versions or explicitly highlight the new contributions of this paper. Notably, the repository retains the header from the previous paper, and its provided link does not appear to be functional.

Below are the minor comments:

Abstract: error in "datasetwe"
Related work:
"Notable examples include CTP". What is CTP?
Methodology:
Line 30 "encompassing various scenarios", which are those scenarios?
algorithm can be found in Subsection 3.1. Error, is the same section.
Page 5. Table 1.
In general, tables must have more description in their captions. Even more in those ones that represents information from various subsections. This table 1 must be commented in their caption referring to the information presented. Also, a cross reference in the text should be detailed when the information of the table is being described (e.g., page 6 line 35)
Page 6 line 28. What is q?

Page 7 line 10 : MAG PSY and MAG CS have not been presented yet and is confusing in this part.
Page 7 line 28: an LLama-2

Instruction taxonomy learning results:
Page 8 Table 2. The same problem presented above. Too much information in the table but no in the caption. Moreover, Section 4 is not well explained and related to this section. Paragraph 4 in this page is difficult to understand. More detailed information is needed explaining results relating with the table. As well as in paragraph 6.
Page 8 line 20. They might impose overly stringent criteria. Why?
In this section is difficult to follow why the model is LLama and not Mistral and why the experiments have only been done with Definitions for Mistral.

Downstream tasks
Page 9. A reference to figure 2 could be provided also at the beginning.
Section 5.2: Bad indentation. Usually, results are in another subsubsection.
Table 3 needs more information in the caption

Ablation Study
Table 8 should be presented before Table 7 as they are presented in the text. Captions are correct in this case as they are concise and clear as well as in their explanations.
Mismatch between headers of table 8 (2A,2B, etc) and the explanation in paragraph 2 of page 14
Page 17 line 29. DFS has not been explained at all or have any reference about it.

Page 21 header should be named Bibliography not conclusions
Some references are not in the text as [16]

Overall, this is an excellent work of significant relevance to the domain of the Semantic Web and Knowledge Graphs. It deserves publication in a journal such as the Semantic Web Journal.

Review #4
Anonymous submitted on 24/Feb/2025
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include
(1) originality
The task of using LLMs for taxonomy-related tasks is not novel, but it fills a gap of testing the efficacy of newer LLM models for this task (such as LLaMA). Moreover, they introduce a method for creating a training dataset (an instruction dataset) which encompasses multiple tasks, they train a few models with these instructions, and test the models on down-stream tasks. I would say the breadth of the work (with its ablation studies as well) is simultaneously its contribution. The ablation study gives useful insights into using LLMs for the task. The work is based on two previously published papers in peer-reviewed conferences in computational linguistics. The work is extended by: (i) a new examination of LLMs to solve graph cycles, (ii) an investigation of how LLMs can leverage multiple relations, (iii) an exploration of advantages of bidirectional relations to refine taxonomies, (iv) inclusion of another wordnet subset (food).

(2) significance of the results
The TaxoLLaMA model seems to do very well on the taxonomy construction task for different domains and languages, significantly surpassing SOTA results, as is shown in Table 4. The ablation study provides new insights for training LLMs on taxonomy -related tasks. In general, experiments seem sound, but (as mentioned below) improvement of the general writing of the document will make it easier to understand the article's contributions. Moreover, IMO technical details and choices made in experiments are at times unclearly described.

(3) quality of writing.
The text contains many spelling errors and ungrammatical sentences, and often sentences and even sections and their structures are very difficult to follow, with pieces of information often times not given in a logical order, which makes the text a little chaotic and also makes it more difficult to figure out which interesting insights have been derived. This applies mostly to Section 3 and 5-7. I will give some examples below, but these are not exhaustive. Some definitions might seem obvious for linguists or the natural language processing domain, but the article should be self-contained.

Section 3:
I had to reread section 3 several times to understand what was done; improved writing and organisation can greatly improve this section.
* where do these numbers come from in Table 1 (it is mentioned that the goal is 1000 samples in the test set, but I only see ~400 and ~800).
* What are hyperhypernyms?
* What is the difference between TaxoLLaMA_multi, TaxoLLaMA and TaxoLLaMA_bench (I now see that this is mentioned at the end of the section. Please include these explanations earlier when the table is mentioned).
* What does it mean to find optimal values for the Bernoulli probabilities? How was this done and optimal in terms of what?
* There are many uses of A-C, making the data collection & formal algorithm difficult to interpret.
* That p and q refer to the chance of (not) being included in the test set should be explicitly stated, since p also refers to hypernym which could be confusing.

As mentioned, I think this section should be rewritten, making sure that all symbols and names are explained directly when they are used. Lastly, section namings could be improved and made more coherent (e.g., formal algorithm → dataset creation algorithm or alike?, Downstream efficient dataset → I am not sure what this refers to. Does this mean the process of making the sampling most efficient in terms of down-stream performance?)

Section 5.1:
“and leave only those below the optimal threshold” → what is the optimal threshold and how was it established?
“We have not used definitions, as they are not given” → what is meant with this?
perhaps briefly introduce what ‘perplexity calculations’ specifically refer to, as well as ‘hypernym perplexity’
“we also apply some additional evaluation with LLM for each task that is described in corresponding sections” → unclear sentence

Section 6:
* Table 7 is not so intuitive to read, as I did at first not interpret easy-hard as a formula, but as a hyphen. Perhaps the formula could be moved to the definition of the colors, or perhaps each cell could give both the easy scores, the hard scores and the difference between them between brackets. Also, Table 7 indicates that for ‘easy’, scores were higher, but for ‘hard’ scores were better.
* It is great that linguistic experts helped to annotate the data, indicating which terms are specialised and which are not.
* Why is the ablation study only performed on the Environment and Science datasets and not on the Food dataset?
* Figure 4: F-scores, should these be F1 scores?
* Thresholds > what do these scores refer to?
* Table 10 should indicate “an overall improvement …” in terms of cycle resolution; I am missing the values with which I should compare the perplexity (cycles) values.

Section 7:
* examples from the Appendix (e.g., example 11 and 12) are referenced in the main text, which is confusing (as this is not indicated).
* It would be great if the visualisations could still be added to the main text, or at least the Figure from the appendix is referenced.

Small things:
* tables are incorrectly formatted (table 3, table 5, etc)
* strange sentences occur such as ‘‘The result for other tasks are lower nearly twice or more”

Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
The dataset is online in a google drive (why not the github itself?) referenced via a Github repository, and the datastructure is described in the repository’s README.md.
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
The provided resources (dataset and finetuning strategies) are available on Github. However, the repository could also include the experiments as well as examples of how to run the code to make the experiments more reproducible.
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
As mentioned, the dataset is on a drive. I would strongly recommend to move the data to Zenodo or to the respective Github repository.
(4) whether the provided data artifacts are complete.
The repository includes the code and data for the previously published paper(s), not concerning experiments or additions done for the journal paper.