Review Comment:
# Publication Summary
The paper presents ProVe, an automatic approach to verify a given statement from a knowledge graph based on the references that are listed in the statement's provenance. The approach generates search phrases based on the statement and generates overlapping text passages from the reference documents. It is worth pointing out that ProVe is able to process references that contain plain text as well as HTML web pages. From this large set of passages, the Sentence Selection step assigns a relevance score to each passage and selects the top 5 passages for further processing. To this end, a BERT model is used for scoring. For each of these top passages, the Textual Entailment Recognition step assigns a score to each of the three FEVER classes, namely SUPPORTS, REFUTES, and NOT ENOUGH INFO. This step also relies on a BERT model. Finally, the Stance Aggregation step uses the class and relevance scores as well as further features from the passages as input and returns a final classification result whether the reference mentioned in the statement's provenance supports the statement or not. Three different aggregation methods are proposed: a weighted sum, a rule-based approach and a classifier (Random Forests are used in the evaluation).
The authors also present the Wikidata Textual References (WTR) dataset. The dataset is based on Wikidata. First, 7M references are selected. The authors sample a subset and process the statements and their references with ProVe's Sentence Selection. Crowdsourcing workers annotate each of the system's selected passages with one of the three FEVER classes with respect to the statement. The same is done for a set of evidence for a given statement. Finally, the authors annotate the references themselves with one of the three classes. WTR contains 416 statement-reference pairs with 76 distinct Wikidata properties.
Section 5 of the paper contains a detailed evaluation. Where possible, the single steps of the approach are evaluated on their own. I won't go into the details of the evaluation. However, the main result of evaluating the complete pipeline is that ProVe performs best if the Stance Aggregation uses a) a Random Forests classifier and b) the classification is carried out as binary classification (supports yes/no) instead of a multi-class classification with all three FEVER classes.
Section 6 discusses results and limitations. For example, the presented version of ProVe does neither take qualifiers within the provenance nor negations within the extracted passages into account.
# Review Summary
## Originality
The main difference between ProVe and the related work is the usage of language models. It could be argued that their application is obvious and has already been used in the related area of fake news detection. However, to the best of my knowledge, there is no work that uses language models for this particular application area. I also think that the problem of training the models is solved in a reasonable way by using the FEVER dataset.
The dataset is a well-described resource that can be useful for the community in the future.
## Significance of the Results
The WTR dataset can have a significant impact in the future. However, it is not possible to state something about the significance of the evaluation results for ProVe. The performance of the system is not compared to any other system within Section 5.4. While the Sections 5.1–5.3 look at intermediate results and it is clear that these might not be compared to the related work (since other systems may not even produce a comparable intermediate result), it is not clear why the overall system's performance is not compared to the performance of the related work. This is a major issue which is further detailed in the next section.
## Quality of Writing
Overall, the writing is good. However, some parts of the paper are lengthy in comparison to their content and should be further improved. I think that this is a minor issue of the current state of the paper and can easily be solved. I listed several suggestions further below.
## Open Science Data
The repeatability of the experiments seems to be already good but can be further improved.
- The code of ProVe is linked in the paper. It seems to come with the necessary data (e.g., the BERT models) and intermediate results to repeat the experiments. Unfortunately, I didn't have the time to rerun something, so it might be missing something that I am simply not aware of. However, the paper should contain some information about the parameters. For example, the paper contains the statement "Population Based Training was used to tune learning rate, batch sizes, and warmup ratio" (p11 l46–47). However, the parameters that produce the final evaluation results are not listed in the paper and I also couldn't find them in the linked source code.
- The dataset is available on Figshare and linked in the paper. The dataset is documented in Annex B of the paper. However, I found it surprising that the dataset itself does not contain any README file. It might be better to have such a README as part of the dataset which describes the data and links to other sources, e.g., the source code of the paper.
## Conclusion
The submission is already in good shape. It has some minor issues in the following areas:
- References and Citations
- Inconsistencies
- Writing
The paper has a major issue with respect to its evaluation. The numbers rely on a newly created dataset and ProVe is not compared to any other system (details follow below). This is a major issue, which leads to my conclusion that a major revision is required for this submission.
# Major: Evaluation
At a first glance, the evaluation is very detailed. The single steps of the pipeline are evaluated one after the other. However, I see two gaps that make the evaluation fail its main purpose.
1. It focusses on a single use case while the introduction names three use cases. As one of the papers inconsistencies, this is a minor issue and is explained further below in the respective section.
2. The evaluation is mainly based on the WTR dataset created by the authors. The different variants of ProVe are compared in the final experiment. However, the main problem is that ProVe is not compared to any other system. This leads to the situation that although we can see accuracy values, it is not possible to argue whether these values are good or not. Maybe the WTR dataset is very simple and the values are low. Maybe the dataset is very hard and the values are good. This major problem of missing interpretability has to be fixed before the paper can be published. I see several ways how this could be done and I am confident that the authors are able to choose and implement one (or even find a better way).
1. Compare to system(s) of the related work
Several times within the paper, DeFacto and FactCheck are named as systems that are closely related to ProVe. Hence, a comparison with these systems seems reasonable, especially since the third use case for ProVe (p2 l47–48) fits very well to the systems of DeFacto and FactCheck. The setup of the experiment would of course be up to the authors. However, as a co-author of FactCheck, I would like to point out that a comparison on the WTR dataset might be slightly unfair, since the dataset has a very large number of different properties compared to the number of triples (76 properties vs. 416 triple-reference pairs). This ratio would be fatal for DeFacto and FactCheck, which are only trained on this small number of examples, while parts of ProVe can make use of the much larger FEVER dataset. A comparison on an updated version of the FactBench dataset might be more reasonable.
2. Add a baseline
If ProVe is only made for the particular use case of classifying the stance of triple references and it does not seem to be possible to have a comparison to systems like DeFacto and FactCheck, the introduction of a reasonable baseline could help. A "standard" approach which shows the difficulty of the task could already outline the difficulty of the dataset and help to be able to interpret the accuracy value. However, the paper would have to include a) a reasonable baseline and b) an argumentation why the baseline can be used for comparison while other systems are not used.
# Minor: References and Citations
- The following references in the bibliography refer to preprints on arxiv.org although the referenced papers have been published at research conferences. I think it would be better to acknowledge the achievement of the paper's authors to have been able to publish the papers on major conferences by citing them accordingly:
[3] COLING 2018 https://aclanthology.org/C18-1283/
[7] EMNLP 2018 https://aclanthology.org/W18-5517/
[10] ACL 2020 https://aclanthology.org/2020.acl-main.549/
[11] ACL 2020 https://aclanthology.org/2020.acl-main.655/
[12] ACL 2019 https://aclanthology.org/P19-1085/
[13] EMNLP 2018 https://aclanthology.org/W18-5501/
[14] NAACL 2018 https://aclanthology.org/N18-1074/
[41] NAACL 2021 https://aclanthology.org/2021.naacl-main.52/
[42] EMNLP 2020 https://aclanthology.org/2020.emnlp-main.621/
[47] NLP4ConvAI 2021 https://aclanthology.org/2021.nlp4convai-1.20/
[55] EACL 2017 https://aclanthology.org/E17-2068/
- The introduction of the paper lacks references to back up claims. For example, the first paragraph of the publication starts with an explanation of KGs. First, it refers to the term "knowledge base" which is not further defined. Second, the whole paragraph has one single citation which does not seem to be connected to the explanation of KGs nor to any of the examples in the paragraph. This is especially important since knowledge graphs have a central role within the paper but do not have any further definition apart from this paragraph. Similarly, the first two sentences of the second paragraph do not have any reference ([2] mentioned in the third sentence does not seem to back up the claims in these two sentences).
- Citing the AdamW creators would be fair (see https://openreview.net/forum?id=Bkg6RiCqY7#).
# Minor: Inconsistencies
- The introduction defines three different use cases for ProVe (p2 l45–l48): "Firstly, by assisting the detection of verifiability issues in existing references, bringing them to the attention of humans. Secondly, given a triple and its reference, it can promote re-usability of the reference by verifying it against neighbouring triples. Finally, given a new KG triple entered by editors or suggested by KG completion processes, it can analyse and suggest references." The remainder of the paper focuses solely on the first use case and the second and third are never mentioned again. It would be fair, if the authors either clearly state that they focus on the first use case and that the others are outside of the scope of the paper or that they move the second and third use case into the future work section.
- Definition of "AFC" in the text vs. related work. The term "is commonly defined in the Natural Language Processing (NLP) domain as a broader category of tasks and sub-tasks [3–5] whose goal is to, given a textual claim and searchable document corpora as inputs, verify said claim’s veracity or support by collecting and reasoning over evidence." (p2 l38–40) Within the following paragraphs, the authors try to widen the definition of AFC to cover works that do neither rely on textual claims nor documents but only on triples and reference knowledge graphs, e.g., [35, 36]. However, this does not work since the definition that has been given before clearly excludes them. Table 1 points out this dilemma when it contains the task "triple prediction" with triples as input and paths as evidence. However, this task is not further defined throughout the paper. In general, I appreciate that the authors try to point out that there is related work that does not work with natural language sources and representations. However, I think it would be better if such approaches would be presented as part of a closely related field of research ("fact validation" or "fact checking" or "triple classification"; maybe not triple prediction as it can be misunderstood as link prediction which again leads to a lot of other works). In this context, I would also like to point out that the subtask "SS" in Table 1 for [31, 32, 35, 36] does not make sense to me. To the best of my knowledge, they do not use any sentence selection since they solely work with triples.
- Definition of "AFC on KGs". In Section 1, it is defined as "AFC on KGs takes a single KG triple and its documented provenance in the form of an external reference." Section 2 changes this definition to "Given a KG triple and either its documented external provenance or searchable external document corpora whose items could be used as provenance, AFC on KGs can be defined as the automated verification of said triple’s veracity or support by collecting and reasoning over evidence extracted from such actual or potential provenance." The difference in the two versions is that the first one excludes related work like DeFacto and FactCheck while the second definition is formulated in a way that includes them. However, there should be exactly 1 definition for this term that is consistent across the publication. In addition, I would like to encourage the authors to use the active voice in this situation, since (to the best of my knowledge) their work seems to be the only one so far that defines the term "AFC on KGs". Hence, a formulation like "We define AFC on KGs as..." would clearly communicate this.
- Figure 3: The arrows in the figure show that E is a result of the sentence selection. E is then used as input for the Stance Aggregation. However, in the textual description, E is already input for the TER step. Hence, the two descriptions differ in the sense that either the TER step is executed for all (v,p_i) pairs or only for the top 5 p_i.
- p6 l42: The sum of probabilities is not well-defined. I assume that the three probabilities for the i-th evidence together should be 1.0. However, this can be easily misunderstood since the sum does not exactly define whether it sums up over all k or i or both.
# Minor: Writing
I would like to point out that the general writing is good. The paper only contains a small number of writing errors (compared to its length). The errors I found are listed further below. However, the paper is written in a style that is slightly exhausting for the reader.
## Writing Style
The style in which this paper is written comes with two main drawbacks: long sentences with verbose formulations and repetitions.
- The paper comprises very long verbose sentences with unnecessary complicated formulations. While a reader of this review may have already noted that I tend to have very long sentences as well, this is typically discouraged in scientific publications. The sentences should be short and to the point. I do not ask the authors to rewrite the whole paper. However, it would be good if they could at least cut the longest sentences into several shorter sentences with a clearer structure. Some examples of long sentences or verbose formulations (this list is not complete; there are many more of them):
-- p10 l31–33: "Sentence selection consists of, given a claim, rank a set of sentences based on how relevant each is to the claim, where relevance is defined as contextual proximity e.g. similar entities or overlapping information." --> "Our sentence selection ranks the generated passages according to their relevance with respect to the given claim. We define relevance as contextual proximity, e.g., similar entities or overlapping information."
-- p10 l40-43: "Fine-tuning is achieved by feeding the model pairs of inputs, where the first element is a concatenation of a claim and a relevant sentence, while the second element is the same but with an irrelevant sentence instead, and training it to assign higher scores to the first element, such that the difference in scores between the pair is 1 (the margin)." --> "We fine tune the model by feeding pairs of inputs. The first element of a pair is a concatenation of a claim and a relevant sentence. The second element is the same claim but with an irrelevant sentence. We train the model to assign higher scores to the first element, such that the score difference is 1."
-- p18 l24: "ranging from as low as 1 to as high as 804" --> "ranging from 1 to 804"
- While the overall structure of the paper is very good, parts of the content are repeated several times. Especially Section 5 repeats a lot of the content of Sections 3 and 4. Some examples (this list is not complete; there are more of them):
-- Terms are introduced repeatedly, e.g., the three types of stances are introduced two times within Section 3.1.
-- Section 3.1. gives the overview over the algorithm two times. First from p6 l37 to l46 and after that again from l47-p7 l6. I understand that in the first part, ProVe is explained as a single algorithm while the second part describes Figure 3. However, I would suggest combining these descriptions (if possible).
-- Section 5.1.1. is a repetition of results presented in [46]. From my point of view, stating that the approach for this part of ProVe has already been evaluated with a short summary of the result and a reference to [46] would be better.
-- In Section 5.3, it is three times explained that the top 5 passages from the evidence set are used. This is something that a) has already been clearly defined in Section 3.4. (e.g., equation 5) and b) could maybe be repeated once but not three times in a section that focuses on the evaluation.
-- The first paragraph of Section 5.3.2.3. repeats content from Section 4.2.
-- p21 l37: "Crowdsourced annotations are collected multiple times and aggregated through majority voting, with authors serving as tie-breakers." While this has already been described in Section 4 and a reader might be already used to this kind of repetition, this sentence may confuse readers since it might not be clear why crowdsourcing is applied _again_ (which is not the case).
## Typos and smaller Errors
- p1 l44: search engines results --> search engine results
- p2 l17: "well explored" --> "well-explored"
- p2 l27: state-of-the-art --> state of the art
- p2 l40: as well a --> as well as a
- Figure 1: It seems like the figure cuts off a part of line 4.
- p5 l28: but use --> but uses (or the previous "does" has to be "do")
- p6 l45: "as well as to calculate" something seems to be missing there.
- p7 l31: the meaning of v is not clear in that line. Is it the sentence or does it represent "the same information" that is expressed? A reader has to read further to understand that it refers to the sentence. I think that the v can be placed better beforehand to avoid such a misunderstanding.
- p9 l25: "KG triple" --> "in a KG triple" (or something similar)
- p9 l37: "text text"
- p10 l4: s_{j} has a strange whitespace in front of it.
- p10 l6: "a n-sized" --> "an n-sized"
- p11 l22: "a pre-trained BERT" --> "a pre-trained BERT model"
- p11 l42: "‘NOT ENOUGH INFO’" --> "‘NEI’"
- p12 l10: "weighted sum \sigma" --> naming the weighted sum sigma at this point doesn't seem to be correct to me.
- p12 l13: "probability" --> "probabilities"
- p12 l29: "y is 1 is" --> "y is 1 if"
- p12 l29: "the triple-reference is" --> "the triple-reference pair is"
- p13 l47-48: the double quotes around the URL seem to be two single quotes. That should be fixed.
- p14 l10: "according according" --> "according"
- p14 l13: "is carried" --> "is carried out"
- p15 l28: "were carried through" --> "were carried out through"
- p15 l28: "Its structured" --> "Its structure"
- p16 l48: "properly carried" --> "properly carried out"
- p18 l37: "a sanity check of by" --> "a sanity check by"
- p18 l47: "current state-of-the-art, on the" --> "current state of the art on the"
- p19 l8: "(-1 to 1)" --> "($-$1 to 1)"
- p20 l47: "is carried" --> "is carried out"
- p21 l44: "domain such as" --> "domains such as"
- p21 l49: "Using this argmax approach" --> referring to something with "this" at the beginning of a new subsection is not good since it is unclear what this word refers to. This should be formulated in a different way.
- p21 l51: "once can measure" --> "one can measure"
- p25 l20: "As the same time" --> "At the same time"
- p25 l35: The last sentence of 6.2. is incomplete.
-
# Further Suggestions and Comments
The following comments are "neutral" in the sense that they do not influence my final rating of the paper. Instead, they should be seen as suggestions to further improve the quality of the paper.
- Why is the maximum size of the sliding window only n=2. Are there any results with respect to larger window sizes? According to the argumentation in 5.3.2.2., it would be possible to reach 0% irrelevant passages if we simply always select the complete document (assuming that the reference contains anything of relevance and that the LM is able to identify that). I assume that there is some limitation and it would be good if the authors could point out why larger passages would not work.
- It is not necessary to introduce abbreviations several times (e.g., AFC, NLP, KG).
- If an abbreviation is introduced, it should be used consistently instead of the term that it abbreviates (e.g., SUPP, REF, NEI).
- I would suggest avoiding unnecessary "rating" adjectives like "simple" as long as the authors do not have a particular reason to use them (e.g., p12 l10/l43).
- In some parts, the text switches between tenses (e.g. 5.3.2.1.). This should be avoided.
- Vague formulations like "can be defined" should be avoided. Instead, it should always be clearly stated how the authors define the terms. (e.g., p19 l32: "one can define the value zero as the threshold")
- Most of the footnotes seem to be misplaced (e.g., 1–4). To the best of my knowledge, a footnote should either follow directly the name of the tool without any white space or it should be placed at the end of the sentence (as it has been done for footnote 5).
- Figure 1: I like that there is a clear separation of data and processes and that data is the input to a process which outputs new data. However, between "Document Retrieval" and "Evidence Selection", there is no data object. Maybe "Relevant Documents" could make sense between them.
- Figure 4: some text is very small, especially the text on the arrows.
- p10 l3: {S^{i}}_{j} --> S^{i}_{j} to ensure that i and j are beneath each other, or (better) S_{i,j} to avoid the confusion with exponents; if the authors prefer the first solution, I would like to point out, that the subscript is typically the start and the superscript is the end (e.g., integrals).
- p21 l27--31: It is typically good to choose a single format for numbers and to stick to this format. If it is necessary to report 4 digits after the comma, it should be done for all numbers (i.e., "0.617"-->"0.6170"; "0.76110"-->"0.7611"). Presenting the results in a table might be better in this particular example.
- Figure 5: a dashed vertical line for the relevance score 0 might be helpful within the diagram.
- Figure 7: if the sentence selection assigned relevance scores in the range [-1,1], why is it possible that the line in the diagram has points with a value >0 outside of this range? I would suggest that the results are represented in a way that avoids this kind of misunderstandings.
- Figure 8: The names "Supports Model" and "Supports Crowd" should be replaced by only "Supports" since the axis labels clearly define that the rows show the crowd sourcing results while the columns show the TER model's class predictions.
|