Testing Prompt Engineering Methods for Knowledge Extraction from Text

Tracking #: 3606-4820

Authors: 
Fina Polat
Ilaria Tiddi
Paul Groth

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
The parameterized knowledge within Large Language Models (LLMs,) like GPT-4, presents a significant opportunity for knowledge extraction from text. However, LLMs’ context-sensitivity can hinder obtaining precise and task-aligned outcomes, thereby requiring prompt engineering. This study explores the efficacy of different prompt engineering methods for knowledge extraction, utilizing a relation extraction dataset in conjunction with a LLM (i.e. the RED-FM dataset and GPT-4). To address the challenge of evaluation, a novel evaluation framework grounded inWikidata ontology is proposed. The findings demonstrate that LLMs are capable of extracting a diverse array of facts from text. The research shows that the incorporation of a single example into the prompt can significantly improve the performance by two to threefold. The study further indicates that the inclusion of relevant examples within the prompts is more beneficial than adding either random or canonical examples. Beyond the initial example, the effect of adding more examples exhibits diminishing returns. Moreover, the performance of reasoning-oriented prompting methods do not surpass other tested methods. Retrieval-augmented prompts facilitate effective knowledge extraction from text with LLMs. Empirical evidence suggests that conceptualizing the extraction process as a reasoning exercise may not align with the task’s intrinsic nature or LLMs’ inner workings.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Jan/2024
Suggestion:
Reject
Review Comment:

---------
Overview
---------

The paper, overall, is well written. In terms of content there is hardly any redundancy or unnecessary repetition in the ideas presented. Indeed, the way the prompting strategies are discussed in Section 3 with corresponding examples sheds very clear light on the methodology; this section almost seems like a manual for other researchers interested in trying out various prompting strategies. Having said this, the paper is full of short paragraphs, which makes it seem choppy in terms of readability; it can benefit from another round of edits to incorporate longer paragraphs instead. Furthermore, as a research paper for the SWJ I think in its current form it is very lean on empirical insights, since it seems more like an application of already existing methods on one dataset perhaps leaning more in the direction of a solid workshop paper on the theme of LLMs.

---------
Strengths
---------

Various flavors of prompt engineering strategies are introduced, very well described, appropriately applied, and soundly tested.

----------
Weaknesses
----------

> I think the paper is relatively lean in terms of contextualizing the work w.r.t. existing work in the field. I remain unsure what the end objective or purpose is for this work? Is it just to test various prompt engineering strategies? … and for which application exactly?

> I think as far as knowledge extraction, the open-domain Wikipedia is certainly one of the more tried and tested successful approaches out there, but there are others too. Such as in the biomedical domain. A notable mention is the BioCreative shared task series (https://biocreative.bioinformatics.udel.edu/tasks/) which each year releases invaluable databases of biomedical relational knowledge in the field or even the UMLS (https://www.nlm.nih.gov/research/umls/index.html). As a SWJ paper, I would have liked for a paper along the lines of this submission to set a broader vision in terms of application and insights regarding the prompt engineering for knowledge extraction theme. The present version of the work seems too focused and a bit lean on the empirical generalizability or applicability of the methodologies.

> The empirical evaluations are also lean. Only the GPT-4 model is tested. The downside with using proprietary and closed-source models is that no insights can be obtained in terms of open-source LLM development for future work. More comprehensive tests with at least two or three other—preferably non-proprietary—models, e.g. Mistral AI [1] or LLAMA [1], is warranted. The paper sets the tone as an empirical evaluation work. In this regard, while various flavors of prompting strategies are tested, as a well-rounded empirical evaluation, tests with at least 2 or 3 more LLMs is warranted to offer the reader well-rounded insights.

> I like that the paper references the prompt engineering manual (https://www.promptingguide.ai/techniques), but in terms of insights—-tieing in to the previous point—-as a journal paper for SWJ, I would have liked to see more insights and recommendations with more LLMs. Is a prompting method consistently strong across various tested LLMs? Which prompting method is recommended for the task given the respective LLMs?
In general, even if multiple LLMs were not tested another direction for contribution could be perhaps introducing a new prompt engineering method would also shows how this work goes beyond existing prompt engineering work as a research paper. Otherwise, tests with other relation extraction datasets would also be considered as offering strong empirical insights. For instance, as referenced before, the biomedical domain (e.g., Biocreative, BioNLP) has various strong datasets to support RE.

***********
References
***********
1. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand et al. "Mistral 7B." arXiv preprint arXiv:2310.06825 (2023).
2. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023).

--------------------------------------
Comments and questions for the authors
--------------------------------------

> Page 4, Line 44: Aren’t “task description” and “direct instruction” the same thing? You seem to say the prompt in lines 47 to 51 follow a zero-shot setting and thereby (see line 44) “there is no task description … incorporated into the prompt.” I think there is a problem in naming here. The x-shot setting (or in-context learning) only concerns itself with task examples; I do not think it has anything to do with task description or what do you think? In my view, “Task descriptions” and “task instructions” seem to be the same thing. Without a “task description,” how is the LLM to know what is expected of it? Maybe consider using some alternative naming to describe what you mean precisely.

> Page 5, I just had trouble interpreting the phrase “a one-shot RAG prompt” as it was intended by the authors. My first read of it felt like it was the prompt used by the retriever that did the RAG operation instead of the prompt for the LLM. Maybe the authors could consider rewording.

> This is relatively minor and offered as a point of reflection to the authors. Does the content in subsection 3.2 contribute to the theme stated by the overall section title “Prompt Engineering Methods”? To me, somehow it does not merit a subsection per say. Instead, it seems like the text that is subsection 3.2 can be incorporated in subsection 3.3 as a motivation instead.

> Page 9, section 3.6, lines 23 to 33: for the in-context example, i.e. the Audi-concerned text, was the generated knowledge instance, i.e. “Knowledge: [“Automobile is a/an concept.”, …]”,—was this automatically generated as well in a separate step or is this human-written? What does the whole workflow look like?

> Page 9, section 3.6, lines 23 to 33: I actually do not understand how the generated knowledge is helpful to the desired end goal i.e. the generation of triples. The aspect mentioned as knowledge, i.e. “Knowledge: [“Automobile is a/an concept.”, …],” is not quite part of the final set of Triples elicited, is it. Maybe an explanation of what the authors deem as knowledge and thereby have encoded it would add clarity.

> The “Self-Consistency Prompts” method and the “Reason and Act Prompts” method seem quite similar. I think more explanation or justification or clarification is needed when one would use one over the other.

> Page 13, line 15: it says “seventeen” distinct prompting strategies. Section 3 introduced six (considering 3.2 as part of 3.3). I see from Table 1 what is meant by seventeen. Maybe then consider rephrasing “six different prompting strategies under three different prompting settings…”

> Page 14, line 37,38: again minor and for readability. When explaining the results, I would suggest including text that directly points the reader to which cell is being read. For instance, something along the lines: “compare the numbers in the last columns for the row “zero-shot” versus “one-shot”, …”

Review #2
By Dimitris Kontokostas submitted on 17/Feb/2024
Suggestion:
Minor Revision
Review Comment:

The authors of this manuscript are evaluating different prompt engineering methods for knowledge extraction (KE). In particular, they examine the performance of zero-shot, one-shot and few-shot prompt KE in different large language models and attempt to provide a generic evaluation framework based on the Wikidata ontology.
The manuscript is well-written and easy to follow. The fact that the authors provide the code and all the SPARQL queries in the appendix further strengthens their work.
The background section is short but comprehensive and covers all three aspects the authors worked on: knowledge extraction from generative models, prompt engineering and related evaluation methods. In section 2.3, the authors argue about the inapplicability of existing KE evaluation methods for LLM-based KE. Extending their argument with more details and/or a couple of examples would benefit the reader.
In section 3, the authors explain how they set up their prompt templates. Using diverse prompting approaches that trigger different parts of the inferencing capabilities of a model is a good addition to this work.
Section 4 describes the evaluation setup, where the users define the models they used and how Wikidata ontology was used as a generic evaluation framework. Using a schema-driven approach for evaluating knowledge extraction can help scale the methodology and reduce manual annotation. However, regarding the use of the Wikidata ontology, a few things should be noted/accomodated in the manuscript.
1) the Wikidata ontology is dynamic, and minor user edits in top-level classes or metaclasses could significantly affect instance data. For this reason, using a specific Wikidata dump with a local SPARQL endpoint or performing the complete evaluation in a short (and documented) timeframe ensures that possible ontology editing during the execution has no or minimal effect on the results.
2) The Wikidata ontology can get quite noisy, e.g. see [1] and [2], which should be noted in the discussion of the results.
3) The entity and property linking performed in this work is "rudimentary", as the authors also acknowledge in the conclusions. But, this is not discussed at all in section 6. Mentioning how linking errors (not finding a match when it exists or linking to a wrong entity/property) can affect the evaluation would help the reader understand the limitations of the current results. Also mentioning the possible ways linking could be improved would show the authors have already identified the linking issues and (future) approaches for tackling them.
Section 5 looks insightful and well-written and appears to cover all aspects of possible evaluations. Some ideas that could be discussed are how rerunning the exact input text with the same or slightly different prompts could affect the extracted triples and how this could affect precision or recall when these triples get combined in various ways. Similarly, for combining the triples extracted from multiple prompt templates.

[1] https://upload.wikimedia.org/wikipedia/commons/1/1b/Wikidata_ontology_is...
[2] https://www.wikidata.org/wiki/Wikidata_talk:Ontology_issues_prioritization

Review #3
Anonymous submitted on 06/May/2024
Suggestion:
Minor Revision
Review Comment:

General Comments:
- The paper stands as a good contribution to the intersection of Large Language Models (LLMs) and testing prompt engineering methods for knowledge extraction from text using Large Language Models (LLMs) like GPT-4. The study investigates and evaluates state-of-the-art prompt engineering methods for extracting knowledge from textual data. The effectiveness of each prompt engineering method is accessed in terms of their ability to extract accurate knowledge, i.e., triples from text.
- A brief description of different state-of-the-art prompt engineering methods is provided with illustration.
- It would be beneficial to ascertain whether the proposed assessment approach can be generalized to access other knowledge graphs (KGs) and ontologies.
- In summary, the principal contribution of this work is the exploration and evaluation of different prompt engineering methods and the introduction of an ontology-based evaluation protocol, which collectively enhance the effectiveness and reliability of knowledge extraction from text.

Positive:
- GitHub repository provided for reproducibility
- All SPARQL queries used in the evaluation are included in Appendix A. Appendix B shows a full example of the evaluation process's output.
- Clear examples of various prompt engineering methods and plots demonstrating the evaluation are provided

Negative:
- Suggestion GitHub: README is not very clear for reproducibility.
- The methods are not evaluated based on the overhead caused by each of the prompt engineering methods in terms of time
- On page 14 (line 35), page 15 (line 33), and page 16 (line 13) spaces between numbers and % sign are not required to maintain consistency
- Novelty of extractions: It is unclear whether the results demonstrating the validity of the extracted triples that do not match the domain and range in Wikidata are provided. Furthermore, it needs to be more evident how much confidence can be placed in the accuracy of these triples.