Review Comment:
This paper has two main contributions:
- a new dataset for Wikidata AFC, and an accompanying novel annotation guideline to that end, and
- a new AFC pipeline called ProVe, which has been tested to outperform the state of the arts on the above dataset.
I find the first contribution to be compelling and highly useful for future research in AFC.
However, I have some concerns regarding the proposed ProVe pipeline.
The pipeline lacks originality. Or, more specifically, it comprises parts that are not novel, as they are just reused from other work without significant modification. This alone is not a problem, as I agree with the authors that the overall pipeline is still useful. Moreover, it is shown to outperform the state of the arts.
Nevertheless, I find some claims and decisions made regarding the buildup of the pipeline problematic. I understand that considering every single thing properly would warrant an ungodly amount of ablation study. Still, I will highlight them below in detail, especially those that can be justified through rewriting and/or revisiting the related work.
Next, I find the data and code provided to be incomplete. The text extraction external link brings me here: https://anonymous.4open.science/r/WD_Textual_References_Dataset-510B/ which is empty. Other than that, they seem to be there (I haven’t personally tested anything), but I have to say that they are messy. For instance, I’m not sure others would need to see all these Jupyter Notebook files. Just some modular Python files that can be called to run all these multiple different scenarios should be tidier. Finally, while I see that each component is nicely separated into folders, I don’t see instructions on how to seamlessly run everything end-to-end (or to somehow integrate it with Wikidata as it is the goal of the work).
Other than that, I find the overall new dataset and the result discussion to be clear and well-written, though in some parts the pipeline results are not significantly great (e.g. ~0.4 F1-score on sentence selection, and the 0.6-0.75 F1-score on RTE, class imbalance on RTE but no reported attempt on addressing it). It would be great to have more discussion into these results and how to improve them in the future.
Detailed comments are below.
—--
#1 Highlights related to claims and decisions.
–
#1.1 “HybridFC assumes sentences will be retrieved by either of those systems, ProVe relies on state of the art LMs for data-to-text conversion.” → HybridFC also uses a sota LM as a part of its data-to-text conversion. Whether HybridFC uses FactCheck or not is disjoint with its use of sota LM.
–
#1.2 “Lastly, evidence document corpora normally used in general AFC tend to have a standard structure or come from
a specific source. Both FEVER [25] and VitaminC [50] take their evidence sets from Wikipedia, with FEVER’s
even coming pre-segmented as individual and clean sentences. Vo and Lee [51] use web articles from snopes.com
and politifact.com only…. KGs, however, accept provenance from potentially any website domains. As such, unlike general AFC approaches, ProVe employs a text extraction step in order to retrieve and segment text from triples’ references.” → Other techniques such as HybridFC also perform snippet extractions. Although HybridFC is limited to Wikipedia, it does not rely on its structures but only on the texts, so in principle, it can be used for other websites. Therefore, I do not think that this warrants such a strong claim on ProVe’s side.
–
#1.3 “Additionally, explainability for task-specific graph architectures, like those of KGAT and DREAM, is harder to tackle than for generalist sequence-to-sequence LM architectures which are shared across the research community [52–54]. Slightly decreasing potential performance in favour of a simpler and more explainable pipeline, ProVe employs LMs for both sentence selection and claim verification.” → I don’t see enough evidence that the explainability of GNN is harder than that of seq2seq NLP models. Or that LM is more explainable than GNN. I have personally never seen any work to directly compare the two (the authors should provide some references otherwise). If the authors tried to argue (by proxy) that there is (qualitatively or quantitatively) more progress on the explainability of NLP models than GNN models, then this is a weak argument for that. Even qualitatively, to just name a few, there are GraphXAI or GraphLime for GNNs’ explainability. The current ProVe also does not address explainability, it is still something in the future work.
–
#1.4 “As a subtask in AFC on KGs, claim verbalisation is normally done through text patterns [27, 29] and by filling templates [28], both of which can either be distantly learned or manually crafted. ProVe is the first approach to utilise an LM for this subtask. Amaral et al. [55] shows that a T5 model fine-tuned on WebNLG achieves very good results when verbalising triples from Wikidata across many domains. ProVe follows suit by also using a T5. → Amaral et al. “show”.
What’s the goal, or the purported benefit, of using a fine-tuned LM in verbalization? This is not so clear to me and it’s not well-argued (nor sufficiently supported by evidence) in the paper or in the references. Is it just for the sake of being different?
I’d argue that rule-based (text pattern) verbalization approaches can still work well as long as it’s consistent and general enough and that, with regard to AFC purposes, a downstream sentence selection method (like the one used in HybridFC or KGAT) can recognize the verbalization similarity with the supporting sentence well-enough. In the original WDV paper, I also don’t see any discussion of whether the T5 model is a surefire better alternative than a rule-based approach for AFC. Rather, looking at the ~80% adequacy and ~3-4/5 fluency (and the weak agreement between annotators) reported in the WDV paper makes me wonder whether it really is a good enough alternative.
As far as I understood, WDV data is also manually crafted and distantly supervised because it relies on the surface labels of Wikidata (which is crowd-labeled) of the claims to build the seq2seq data, so in this regard, this is also not something that is (in an obvious way) a benefit compared to the previous work.
From a broader PoV, while it might be true that within the AFC context, this is the first work that leverages a fine-tuned encoder-decoder model for triple verbalization, it has been done in other contexts many times. For example, in “Semantic Triples Verbalization with Generative Pre-Training Model” by Blinov (fine-tuned GPT-2 model) and “Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers” by Montella et al. (fine-tuned BART-like model). In the context of ontology alignment, many approaches have utilized some verbalization of triples and have leveraged language models to check whether a pair of verbalized triples are similar. For instance, one can check some submitted papers to the annual OAEI (Ontology Alignment Evaluation Initiative).
All-in-all, I’d suggest making the motivation for using LM here explicitly clearer, making the novelty softer, and citing other verbalization work in other contexts.
–
#1.5 “Following on KGAT’s [22] and DREAM’s [21] approach to FEVER’s sentence selection subtask, ProVe employs a large pre-trained BERT transformer. ProVe’s sentence selection BERT is fine-tuned on the FEVER dataset by adding to it a dropout and a linear layer, as well as a final hyperbolic tangent activation, making outputted scores range from −1 to 1. The loss is a pairwise margin ranking loss with the margin set to 1.” etc. → I looked at KGAT’s repository and this is exactly how it went there. Does ProVe just reuse KGAT’s sentence selection approach or is there any novelty here? From the textual explanation, I would think they (ProVe and KGAT’s sentence selection) are the same, but these paragraphs are too verbose to just re-explain what previous work has done while adding no additional context. Just for example if the authors want to be verbose, perhaps they can explain things that were unexplained in the KGAT’s paper but reused in ProVe. For instance, why do they choose to go with tanh and margin ranking loss, when it is also possible to have a standard softmax layer with a cross-entropy loss? In Equation 7, the negative tanh output (rho) is scaled to 0 anyway, right? So why not have a softmax and set the irrelevant pairs to 0 and relevant ones to 1? Also, why choose to have a dropout layer here, but not in the RTE model?
It is also implied that ProVe retrains (re-fine-tunes) the sentence selection subtask. But if it is exactly the same as KGAT, why re-fine-tune? Why not just use the already fine-tuned model? If I go further, if we are just using fine-tuned models, then why not explore newer/larger models that are tuned for sentence similarity and use that for sentence selection that is potentially better than the 3-year-old KGAT? For example, the SBERT models or many other models within the huggingface repository.
There is also a strange choice of using BERT-large when KGAT showed that BERT-base is better on the dev-set and RoBERTa-large is better both on the dev-set and test-set. Why use the mediocre, but not-efficient option?
–
#1.6 About the RTE model:
There is a small (but important) contradiction to address. The textual explanation says that the concatenation is between a verbalized triple and a piece of evidence, but in formula 6 it is the other way around.
This is important because in BERT there is an issue with token length. If the concatenation is reversed, then the verbalized triple might be cut off which should never happen. However, cutting off a piece of evidence is also undesirable as you might cut off the important parts. How do the authors mitigate this? I don’t think this is explained.
–
#1.7 About the stance aggregation: this section is too verbose. The 3 strategies are each very straightforward. I think the author should just stick with the best one they found, explain it as clearly (but concisely) as possible, and leave the rest in an appendix as an ablation study.
Still, I do have some concerns about strategies 2 and 3. In the second strategy, it is assumed that the supporting pieces of evidence take higher precedence than the refuting ones. Why? This is not clear/trivial. If I take an extreme example, if there is a single supporting piece of evidence but 10 refuting ones, the strategy will aggregate them all as supporting. How does this make sense?
In the third strategy, the authors employ a simple classifier with the probabilities of sentence selection and RTE models, and also the number of pieces of evidence as features. Why are these considered sufficiently good features for the intended model? I mean, we are dealing with textual data and have language models lying around, why not leverage these powerful models for this purpose as well? In theory, they should offer better “features” than the ones in strategy 3, don’t they? Please motivate this classifier a bit more.
—--
#2 Grammar and clarity
–
state-of-the-art not “state of the art” when used as an adjective (multiple occurrences in the paper).
"HybridFC converts retrieved evidence sentences into numeric vectors through sentence a embedding model.” → word order
“We define ontological predicates as those whose meaning serves to structure or describe the ontology itself, such as subclass of and main category of).” → dangling ‘)’
“These approaches score sentences based on relevance to the claim and use a supervised classifier to classify the entire web page.” → classify into what?
“Alternatively, triples with particular predicates can be easily selected.” → as an alternative to what?
“The existence of a document retrieval step depends on whether provenance exists or needs to be searched from a repository, with the former scenario dismissing the need for the step. This is the case for ProVe, but not for the DeFacto line, which search for web documents.” → which “searches”. Also, this is confusing to me, which “former scenario” removes the need for document retrieval? Also, how would one know beforehand whether provenance already exists within a repository? That doesn’t sound trivial.
|