End-to-End Trainable Soft Retriever for Low-resource Relation Extraction

Tracking #: 3656-4870

Authors: 
Kohei Makino
Makoto Miwa
Yutaka Sasaki

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
This study addresses a crucial challenge in instance-based relation extraction using text generation models: end-to-end training in target relation extraction task is not applicable to retrievers due to the non-differentiable nature of instance selection. We propose a novel End-to-end TRAinable Soft K-nearest neighbor retriever (ETRASK) by the neural prompting method that utilizes a soft, differentiable selection of the $k$ nearest instances. This approach enables the end-to-end training of retrievers in target tasks. On the TACRED benchmark dataset with a low-resource setting where the training data was reduced to 10\%, our method achieved a state-of-the-art F1 score of 71.5\%. Moreover, ETRASK consistently improved the baseline model by adding instances for all settings. These results highlight the efficacy of our approach in enhancing relation extraction performance, especially in resource-constrained environments. Our findings offer a promising direction for future research with extraction and the broader application of text generation in natural language processing.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/May/2024
Suggestion:
Reject
Review Comment:

This paper addresses the relation extraction problem. It introduces an end-to-end trainable method based on a k-nearest neighbor retriever.This paper addresses a relevant problem, which is within the scope of the journal. The proposed method is original and encouraging results have been reported.

The main limitation of the paper relies on presentation aspects. The paper also lacks clarity in the description of the target problem. Motivational aspects are unclearly described. Also, several key concepts are not introduced clearly. The provided description of the proposed method is unclear and lacks rigor. As it is described, the proposed method can not be reproduced. Based on the abstract, the proposed solution is suitable for resource-constrained environments. The validation protocol does not include the assessment of the efficiency aspects of the proposed method. The results are not conclusive in demonstrating the superiority of the proposed method when compared with baselines.

Specific comments:

1. The context and the target problem are defined unclearly in the abstract. The provided overview of the proposal is unclear. It is unclear what the authors mean by “low-resource setting.” Also, it is unclear which “settings” are considered in the experiments.

2. Motivativational aspects related to the instance-based relation extraction using text generation models are unclear.

3. The overview of the literature provided in the introduction is unclear. The use of long sentences compromises the reading flow. On several occasions, it is hard to follow what is being conveyed.

4. Motivational aspects are also handled unclearly. The position of the proposed method regarding the state of the art is unclear.

5. It is unclear what the authors mean by “simple vote” in the introduction.

6. “This study addresses these challenges:” It is unclear to which challenges the authors refer.

7. The related work section needs to be revised carefully. The writing is confusing and the provided take-home message is unclear. The presentation of key concepts is not rigorous (e.g., RAG, soft prompt, neural prompt). For example, it is unclear the relevance of the math notation used in Section 2.1.

8. It is unclear the motivation behind the use of LoRA.

9. The paper lacks the inclusion of a schematic presentation of the proposed method, including its all components (especially all those investigated in the ablation study). The structure of the paper should be aligned with the section referring to the descriptions of the main components of the proposed method.

10. The pipeline provided in Figure 1 is unclear. Its components are not described in the paper (nor in the caption).

11. Algorithm 1 is not described in the text. Several elements of the algorithm are not introduced clearly. What is the difference between $W$ and $w_i$? What does “s_l$ stand for? It is unclear where “j” is used for.

12. The paper lacks rigor in the presentation of the math notation adopted. Several variables are not defined.

11. It is unclear what authors mean by instances that are “closely aligned.”

12. The notation adopted in Eq. 7 is unclear. Should it be $Dist(A_l, B_l)$?

13. It is unclear how the neural model is constructed.

14. The notation used in Equation 8 is unclear. Is $P_i$ a vector? $P_i$ is not defined.

15. It is unclear how the warm-up step works.

16. The paper lacks an analysis of efficiency aspects related to the use of multiple samples.

17. Do SuRE (Fran-t5) and SuRE refer to the same method? Which SuRe is used in the ablation study?

18. No distinctive patterns are observed in the results reported in Fig 3. Are those results conclusive?

19. The paper lacks in-depth discussion regarding the provided results. What are the limitations of the method? Which insights could be shared regarding cases of failure? Which scenarios would the use of the propose method be more promising?

Review #2
Anonymous submitted on 18/Jul/2024
Suggestion:
Major Revision
Review Comment:

The paper introduces ETRASK, a method which combine a generation-based relation extraction module with a preliminary knn-like retriever for addressing big knowledge bases. The main novelty stands in the fact that this retriever is fully differentiable, making possible an end-to-end training of the system.

The paper presents quite well the problem and the main contributions, positioning well their work in the (well-reported) literature review, marking the limitation of previous work that they want to overcome. The scientific argument at the basis of the construction of the differentiable retriever are quite solid, and can find application also in other work, combining this approach with a number of other systems not only in relation extraction, but also in content-based recommender systems or link prediction.

On the other side, there are some aspect of the paper which need to be addressed before publication.
The most evident one is in the writing style, which is quite clear and readable until around page 6, then it looks more arduous, with some repetition and some doubtful expression (which I report later).

This results in some hard understanding of the method, leaving the reading some doubts:
1_ In Algorithm 1, what does the letter "l" (as in s_l) refer to? Why we do iterate over "j" if then we do not use "j" at all in the body of the iteration? I believe there is an error in the writing, which we need to correct for the final assessment. As it is written, for all "i" we compute k times the same identical w_i, incrementing s_i of the same quantity k times
2_ The training has been realised with reduced version of the dataset (100%, 10%, 5%, 1%), randomly samples (as made explicit at page 7 line 26). It is not anyway specified if the resulting subsets have the same relation distribution of the full one.
3_ In page 8, several templates are mentioned (input template, relation template, summary template). The input template in in Figure 2, while the summary template is in the text at line 29 (why this difference?). What does the relation template look like? It is not clear what is the role of each template, if the "${something}" pattern is supposed to be replaced by the LM or in other ways and how to ingest them.
4_ Probably the authors should better explain what they mean by "the learning collapsed", but it is not clear why ETRASK has been applied to SuRE-T5, which is not the best performing model according to table 2.
5_ If Table 3 contains the ablation study results on a 10% dataset, the RANDOM scenario (71.0) is potentially higher that the full scenario (71.5+-0.5). How is possible a so high result with randomly selected instances? This may potentially invalidate your hipothesis. Moreover, this RANDOM scenario is about randomly select entities instead using the retriever, is this correct? (If so, it should better clarified in the text)
6_ How do explain the recall and precision going up and down in Figure 3?

Finally, no implementation of the system is available in online repository. This would make harder to replicate and reuse this research. I hope that authors may consider to openly publish their code and experiments.

Other issues
1_ While reading, I have been quite confused if the instances are entities (tokens) or sentences. Finally I imagine that they are sentences, but probably this should be clarified as early as at page 5.
2_ Similarly, ETRASK is finally a technique which can be applied on existing soa methods, but this is clear only at page 8. Until that point, the reader though that you were proposing also the extractor.
3_ 2 paragraphs at page 8 (from line 24 to ~40) needs a proofread and harmonisation, because they are quite hard to read (maybe a figure can help?) and include some typos ("^the^ SuRE", "flame work"). In general the whole section 4 needs a proofreading
4_ The caption of Table 4 is not clear.
5_ In Sec 5.2, can you better explain (maybe using examples) what does it mean that "instances' relation labels match the extraction target's correct relation label"? And "the entity is included in the instance as a string"? Are we speaking about exact matches? Or matches with some tolerance? Are synonyms allowed (or possible)?

Review #3
By Daniel Hernandez submitted on 15/Aug/2024
Suggestion:
Major Revision
Review Comment:

This paper proposes a method to predict the relation between two entities according to a context given in a text. Unlike existing methods, the method proposed in this paper is end-to-end trainable. The authors also show that their method has advantages when there are low training resources.

The paper addresses a relevant problem for the Semantic Web, which is extracting knowledge from text, and in the realistic setting of having low training resources. The authors also show that the proposed method has positive results, and the method introduces some original ideas. However, there are many issues regarding the writing of the paper that makes me impossible to fully understand the method, and even the problem. So, I recommend the paper for a major revision.

The major issue is that the problem is not clearly explained. Only in the Methodology Section, I understood some aspects of the problem. Equation one defines an optimization problem to predict the relation r that is the argmax for the probability P(r | x, h, t), where r is a relation, x is the text, and h and t are two entities that could be related. Since h and t are entities, it is implicit that there is a set of entities E, and that the entities should be referred to in the text x. It is not clear if there are some highlighted subtexts of x which are linked to the entities. For example, Figure 1 shows some examples of highlighted text but does not specify entities' existence.

Figure 1 has several issues that make it difficult to understand. A figure like this should distinguish between data and processes. There is an arrow going from the text I think is the input text to the Relation Instance Database. This could be interpreted as the Relation Instance Database was created from the text with a non-specified process. I guess that arrows represent processes between data because the retrieval process is denoted with an arrow. However, the relation extractor is not denoted with an arrow but with a box. This lack of uniform notation makes this figure very confusing. Also, the Relation Extractor has as input some text with some subtext highlighted, and a document that should contain a list of sentences extracted from the Relation Instance Database. In understand that this extraction from the Relation Instance Database is the Retriever (which is called “retrieval process” in Figure 1). So, the document should be z, and the text with two highlighted subtexts should be the tuple (x, h, t). So, the Relation Extractor has the parameters (x, h, t, z), which are the parameters of the function on the right side of equation 4. It required significant effort to understand this figure and make it compatible with what is stated in the text.

I understand that the authors propose replacing the probability expression P(r | x, h, t) with the model TextGenerationModel(r, x, h, t, Retriever(x, h, t, D)) defined in equations 3 and 4. Notice that I changed the notation “|” with “,” because “|” is only used for probabilities. The output of the Retriever generates an additional parameter for the model, which is called prompt, but it should be clear that it is a text like x. Thus, the retriever outputs a text z, and not a set of instances as is suggested in lines 32-24, page 5.

In this paper, a retriever returns a text z, but in other works a retriever returns a set of instances. Maybe I do not understand what a retriever is.

It was not defined what database D is. On page 5, lines 25 and 43, it is said that D contains instances. So, I may guess that D is a set of instances. However, it was never explained what an instance is. Is an instance a synonym for entity or is an instance a text with two highlighted text spans, with a relation connecting them, as is depicted in the result of the retrieval process in Figure 1? I cannot understand what D is.

On page 5, line 22, it appears the text “verbalized input” without explaining what it is. Is there a non-verbalized input? If the input and outputs of the problem were formally explained, then these adjectives would not be necessary.

On page 6, line 36, you write “single discrete instance.” What does this mean? Is a discrete instance different from an instance? What is an instance?

Equation 6 defines a set K. It also needs to say that dᵢ is in D (which is said in the previous page, but formulas must be self-contained). Similarly, this formula does not explain what sᵢ is. I can understand that it is a distance because it appears in Algorithm 1, but it is not clear enough.

There are several terms whose meaning was not defined: prompt, soft prompting, virtually selected, instance, and entity.

If I am not wrong, the method samples the instances. How does this sampling affect the accuracy of the relation extraction?