Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include
(1) originality
The task of using LLMs for taxonomy-related tasks is not novel, but it fills a gap of testing the efficacy of newer LLM models for this task (such as LLaMA). Moreover, they introduce a method for creating a training dataset (an instruction dataset) which encompasses multiple tasks, they train a few models with these instructions, and test the models on down-stream tasks. I would say the breadth of the work (with its ablation studies as well) is simultaneously its contribution. The ablation study gives useful insights into using LLMs for the task. The work is based on two previously published papers in peer-reviewed conferences in computational linguistics. The work is extended by: (i) a new examination of LLMs to solve graph cycles, (ii) an investigation of how LLMs can leverage multiple relations, (iii) an exploration of advantages of bidirectional relations to refine taxonomies, (iv) inclusion of another wordnet subset (food).
(2) significance of the results
The TaxoLLaMA model seems to do very well on the taxonomy construction task for different domains and languages, significantly surpassing SOTA results, as is shown in Table 4. The ablation study provides new insights for training LLMs on taxonomy -related tasks. In general, experiments seem sound, but (as mentioned below) improvement of the general writing of the document will make it easier to understand the article's contributions. Moreover, IMO technical details and choices made in experiments are at times unclearly described.
(3) quality of writing.
The text contains many spelling errors and ungrammatical sentences, and often sentences and even sections and their structures are very difficult to follow, with pieces of information often times not given in a logical order, which makes the text a little chaotic and also makes it more difficult to figure out which interesting insights have been derived. This applies mostly to Section 3 and 5-7. I will give some examples below, but these are not exhaustive. Some definitions might seem obvious for linguists or the natural language processing domain, but the article should be self-contained.
Section 3:
I had to reread section 3 several times to understand what was done; improved writing and organisation can greatly improve this section.
* where do these numbers come from in Table 1 (it is mentioned that the goal is 1000 samples in the test set, but I only see ~400 and ~800).
* What are hyperhypernyms?
* What is the difference between TaxoLLaMA_multi, TaxoLLaMA and TaxoLLaMA_bench (I now see that this is mentioned at the end of the section. Please include these explanations earlier when the table is mentioned).
* What does it mean to find optimal values for the Bernoulli probabilities? How was this done and optimal in terms of what?
* There are many uses of A-C, making the data collection & formal algorithm difficult to interpret.
* That p and q refer to the chance of (not) being included in the test set should be explicitly stated, since p also refers to hypernym which could be confusing.
As mentioned, I think this section should be rewritten, making sure that all symbols and names are explained directly when they are used. Lastly, section namings could be improved and made more coherent (e.g., formal algorithm → dataset creation algorithm or alike?, Downstream efficient dataset → I am not sure what this refers to. Does this mean the process of making the sampling most efficient in terms of down-stream performance?)
Section 5.1:
“and leave only those below the optimal threshold” → what is the optimal threshold and how was it established?
“We have not used definitions, as they are not given” → what is meant with this?
perhaps briefly introduce what ‘perplexity calculations’ specifically refer to, as well as ‘hypernym perplexity’
“we also apply some additional evaluation with LLM for each task that is described in corresponding sections” → unclear sentence
Section 6:
* Table 7 is not so intuitive to read, as I did at first not interpret easy-hard as a formula, but as a hyphen. Perhaps the formula could be moved to the definition of the colors, or perhaps each cell could give both the easy scores, the hard scores and the difference between them between brackets. Also, Table 7 indicates that for ‘easy’, scores were higher, but for ‘hard’ scores were better.
* It is great that linguistic experts helped to annotate the data, indicating which terms are specialised and which are not.
* Why is the ablation study only performed on the Environment and Science datasets and not on the Food dataset?
* Figure 4: F-scores, should these be F1 scores?
* Thresholds > what do these scores refer to?
* Table 10 should indicate “an overall improvement …” in terms of cycle resolution; I am missing the values with which I should compare the perplexity (cycles) values.
Section 7:
* examples from the Appendix (e.g., example 11 and 12) are referenced in the main text, which is confusing (as this is not indicated).
* It would be great if the visualisations could still be added to the main text, or at least the Figure from the appendix is referenced.
Small things:
* tables are incorrectly formatted (table 3, table 5, etc)
* strange sentences occur such as ‘‘The result for other tasks are lower nearly twice or more”
Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
The dataset is online in a google drive (why not the github itself?) referenced via a Github repository, and the datastructure is described in the repository’s README.md.
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
The provided resources (dataset and finetuning strategies) are available on Github. However, the repository could also include the experiments as well as examples of how to run the code to make the experiments more reproducible.
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
As mentioned, the dataset is on a drive. I would strongly recommend to move the data to Zenodo or to the respective Github repository.
(4) whether the provided data artifacts are complete.
The repository includes the code and data for the previously published paper(s), not concerning experiments or additions done for the journal paper.
|