Plausibility Assessment of Triples with Instance-based Learning Distantly Supervised by Background Knowledge

Tracking #: 1546-2758

Authors: 
Soon Hong
Mun Yong Yi

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
Abstract: 
Building knowledge bases from text offers a practical solution to the knowledge acquisition bottleneck issue. However, this approach has introduced another fundamental issue: rampant triples with erroneous expressions which are rarely found in human expressions. Unlike human expressions, triples are neither guaranteed to be plausible nor making sense because they are extracted by machines that have no semantic understanding of given text. Triple validation is a way of pruning erroneous triples. Recent research, however, does not consider this fundamental difference and has some limitations. First, the difference between plausibility and truth was not well understood. A true/false framework, which is more suitable for validating human expressions, has been used to validate triples. Therefore, some researchers perform plausibility assessments but jump to conclusions about truth or correctness. Second, most researchers use contrived negative training data because it is difficult to define what “negative” is for plausible or true triples. Third, the synergy of combining aWeb-driven approach and a knowledge base-driven approach has not been explored. This paper reviews the process of triple validation from a different perspective, improving upon the knowledge base building process. This paper conceptualizes triple validation as a two-step procedure: a domain-independent plausibility assessment and domain-dependent truth validation only for plausible triples. It also proposes a new plausible/nonsensical framework overlaid with a true/false framework. This paper focuses on plausibility assessments of triples by challenging the limitations of existing approaches. It attempts to consistently build both positive and negative training data with distant supervision by DBpedia and Wikipedia. It adopts instance-based learning to skip the generation of pre-defined models that have difficulty in dealing with triples’ variable expressions more readily. The experimental results support the proposed approach, which outperformed several baselines. The proposed approach can be used to filter out newly extracted nonsensical triples and existing nonsensical triples in knowledge bases, as well as to learn even semantic relationships. The proposed approach can be used on its own, or it can complement existing truth-validation processes. Extending structured and unstructured background knowledge remains for future investigation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/May/2017
Suggestion:
Reject
Review Comment:

The paper describes an approach for assessing the plausibility of DBpedia triples. The main claim of the paper is that triples can be classified as plausible or nonsensical. The paper claims that this classification can be carried out using a combination of evidence from instance and ontology-based generalisation. The introduction is very well written and argues for the need to distinguish triples of the two classes plausible and nonsensical. The authors then argue for some possible applications of their results, which all make sense. I would have refrained from using the term ``nonsensical'', as it demands a model of ``making sense'' which is not discussed by the authors.

Suggestion 1: I would suggest that the authors alter the label to implausible, as the term clearly hints at the more statistical approach followed by the paper.

The related work section covers a good portion of the relevant literature. However, (1) it covers some old versions of papers (see e.g., [18] to which there is a more recent journal paper) and (2) the authors fail to consider the plethora of fact checking works in the area of natural language processing.

Suggestion 2: In particular, I would suggest that the authors consider the generalisation framework from Pasternack et al., which refer to quite a few fact checking algorithms.

Some of the claims made by the authors w.r.t. the performance of other approaches are rather difficult to fathom. For example, the authors state that both positive and negative training data are semantically irrelevant to test data.

Question 1: What do the authors mean by this claim?

Section 3 describes the approach proposed by the authors. The quality of the writing degrades from this point on (see Minor comments).

Suggestion 3: I would suggest that the author of section 1 has a thorough look through Section 3 onwards.

The authors discuss their approach to generating unlabelled structured data. They argue for using 4 methods for measuring the paradigmatic similarity of individuals.

Question 2: Why were these 4 measures chosen?

The authors then present how node substitutions are carried out and give an equation for the selection of nodes for single-node substitutions (see Eq. (5)). One of the measure is dropped without any further mention in the paper.

Question 3: Why is slink dropped?

Two-node substitutions are then presented by the authors. Their implementation foresees the simultaneous (ergo independent) substitution of subjects and objects. The authors then claim that the chances of obtaining ``meaningful'' triples are higher using this method. I disagree. For all 1-k mappings in a population with n resources that can be connected by a given property, the chance to find a correct (in the sense of in the knowledge base, which I guess is the definition of meaningful, given that meaningful is not defined in this context) triple after a random pick of the object node is k/n. This is exactly the same probability as that of picking one of the k*n pairs that are correct from the n^2 possible pairs.

Question 4: Can the authors please prove their claim?

The authors then argue for how to replace predicate. This section is unclear to me.

Question 5: Do the authors replace predicates with terms and not with URIs? If yes, then they do not generate triples by vritue of the definition of RDF triples. If not, can the authors clarify how they go from dictionary entries back to RDF resources?

Section 3.2 looks into labelling training data. The QUORUM operator is used but not introduced. If the authors assume this operator to be a common-knowledge entity, I'd suggest the following:

Suggestion 4: Add a reference to the QUORUM operator. Basically the authors use proximity queries based on a search engine. This approach is criticised by the authors in the state of the art section for Web search engines. It is however unclear how the authors would port their approach to knowledge bases other than DBpedia (where a knowledge base/corpus duality is not to be found) without using a Web search engine.

Question 6: How would the authors port their approach to other knowledge bases?

The computation of p-values is based on the log-likelihood being correlated with chi-squared distribution. The classification is carried out using k-NN. Note that the authors do not say when a triple is to be classified as unknown. The generalisation-level classification is unclear. A resource can belong to m classes.

Question 7: Do the authors generalise based on the m classes? If yes, the owl:Thing - predicate - owl:Thing should come out of their considerations. That does not seem to be the case so what do the authors do exactly? Most specific classes (some nodes have several)?

The authors then claim that the test triples are classified based on the generalisation triples. Basically, the generalisation triple return the domain and range of the predicate. This basically means that in the evaluation, the authors partly measure how well the ontology fits with the data found in Wikipedia. Given that the ontology was derived from Wikipedia, it is far from surprising that the generalisation triples perform well.

The evaluation is extensive. However, a few points of the experimental setup are unclear and make it rather difficult to appreciate the quality of the results.

Question 8: Did the authors select the 4945 triples or were they partly generated? How were the triples selected? Which triples did the authors use to generate nonsensical triples? What are correct concepts for nodes of predicates? When is a triple predicted to be ``unknown''? How can this be when k=1 seems to be used (see table 8)?

The predicates were not to be too specific.

Question 9: How was specificity measured and defined?

I believe the authors use ``concepts'' in the sense of OWL.

Suggestion 5: Replace concept with class, etc. or make clear that you mean it in the sense of OWL.

Plausible test triples were generated using similar rules.

Question 10: Which rules exactly?

The authors present a basic configuration for their experiments in Table 8.

Question 11: Could you give an intuition for the choice of these values?

Given that the exact experimental setup is unclear, it is rather difficult to really judge the quality of the results. Overall, the author do study the influence of their parameters on the quality of their results, which seems adequate. The results with the generalisations are far from surprising as they simply reflect that the ontology matches the data in DBpedia. It is unclear how the annotations from the humans annotators were actually used in the experiments. The baselines used are unclear and do not reflect the state of the art in section 2 of the paper. The authors claim that they take an open-world perspective in the conclusion but this is not reflected in the paper. On the contrary, the authors stating that what has not been observed in a corpus such as Wikipedia is nonsensical goes against the open-world assumption.

Overall, the paper is an interesting attempt to go beyond the black-and-white classification of triples into true and false. The authors push towards labelling triples as nonsensical or plausible. While I commend the authors for their idea, the implementation (especially the technical part of the paper and the evaluation) still demand more than major revision (basically a rewrite). The authors use a plethora of terms without any explanation (e.g., QUORUM, concept, correct, etc.). The evaluation setup being unclear makes an assessment of the results rather difficult. The performance comparison is only done with SDValidate (not a major issue but some of the other frameworks are open-source as well).

Review #2
By Peter Flach submitted on 25/May/2017
Suggestion:
Major Revision
Review Comment:

= Summary =
This paper develops a method for building a triples knowledge base with instance-based learning. The key idea is to separate
plausibility assessment at the instance level from truth assessment at a more general level. The work is experimentally evaluated on a dataset collected by the authors.

= Strong points =
1. Sensible approach, seems to work well.
2. Interesting dataset has been collected.
3. Extensive experiments.

= Weak points =
1. The method has many parameters which were tuned using the test labels.
2. Some aspects not clear, e.g. closed-world vs. open-world.
3. Writing is labored in places, repetitive in others.

= Assessment =
I enjoyed reading the paper and I think it is a good piece of work that eventually deserves to be published. However, I picked up some issues that need addressing. The two main ones are the following:

A. On p.10 the results of evaluation on a test set are given, with parameter settings as in Table 8. Reading on it becomes clear that these parameter settings were derived from other experiments *on the same test set*. From a machine learning perspective this is a cardinal sin, as it bears the danger of overfitting: it lowers the chances that the same parameter settings will work well on independent test data. Instead, the parameters should have been set on an independent validation set. A related issue is that there are many parameters which generally means that you need a lot of validation data to set them.

B. On p.15 there is a second evaluation against baselines. This part of the paper seems rushed as the four PAUST variants are not explained, and I don't know what it means to adopt the open/closed-world assumption in this particular case. Saying that particular test triples are "treated as [true/false] negatives" is curious: isn't that the job of the classifier? To fully evaluate the results I also need to know what the class distribution was (better give TP/FP/FN/TN numbers).

C. The paper states a couple of times that the difference between truth and plausibility is poorly understood and that the work clarifies this. This is an overstatement in several senses. First, there is a massive literature in logic and philosophy addressing exactly this. Second, the way the paper "clarifies" this is purely empirically: it is shown that making the distinction in a particular way helps solve a particular task better. Third, related work is criticised in this respect in strong terms but without actually making the case ("[14], [16], and [32] seem to perform plausibility assessments but hastily reach to conclusions about truth or correctness. However, being plausible does not imply being true", p.3) .

It may well be possible to address these issues satisfactorily, but until that has been done the value of the work cannot be fully assessed. Hence I think requires a major revision.

= Disclaimer =
My expertise is in machine learning rather than semantic web technology and I assessed the paper from that perspective.

== Further comments, typos etc. ==

Throughout the paper: don't use "performances" in plural unless you're in a concert hall!

p.1 The first half of the abstract is rather dense and contains material better discussed in the introduction. More generally, I found the abstract and the first two sections unnecessarily repetitive: e.g., we are told three times that current approaches have three specific limitations and that the proposed approach intends to overcome those.

p.2 Table 1 could be better formatted to indicate more clearly what the tuples are.

p.4 "Paradigmatic similarities between DBpedia individuals and the nodes of test triples are measured based on the following four properties:" -- this may be common in the semantic web community, but I find it curious that the four URIs following this sentence are not given any narrative description.

p.4 The description of S_category and S_context are so similar that they would better be merged.

p.5 "the first few lines of each Wikipedia article" -- should this be DBpedia?

p.5 What are "ranked DBpedia individuals"? What is the ranking method?

p.5 "The third and fourth rows in Table 2" -- it is unusual to count the heading as a row, so this should be second and third (or bottom two). Similar for other tables. In fact, Tables 2-5 are so small they could easily be combined into one.

p.7 You talk about QUORUM before you introduce the Spinx search engine on p.8.

Section 3.2.2: You can just say you use the likelihood ratio test and omit the details.

p.8 I suspect d_i in Eq.13 should be d_t. And Eq.14 would be much more readable if you used fhat(x_l) for the inner argmax.

p.8 The first two paragraphs of Section 4 should come much earlier in the paper.

p.10 How do you determine whether a predicate is too specific, or administrative?

p.10 Give the results in Section 5 as confusion matrices rather then reporting F1, precision, recall etc. And don't give percentage point increases, the reader can see that for herself.

p.11 The ROC and PR curves are quite poor in the first half of the ranking, it would be good to understand better why this is so.

p.12 "It was empirically determined" -- how? See also comments about tuning on the test set.

p.16 "well not understood"

Review #3
By Diego Esteves submitted on 08/Aug/2017
Suggestion:
Major Revision
Review Comment:

In the paper "Plausibility Assessment of Triples with Instance-based Learning Distantly Supervised by Background Knowledge", the authors progress to discuss one of the main issues when creating Knowledge Bases: the data quality aspects. In this context, "triple scoring" methods are designed to detect false/unlikely information which can mislead. Triple Score algorithms are defined as a sub-type of fact checking models. They correctly claim that such algorithms and frameworks can improve the data quality in KBs. Furthermore, they discuss the concept of "plausibility", which can be taken into consideration during preprocessing in order to minimize the error propagation of those algorithms. Finally, they try to overcome an important issue existing in most fact-checking pipeline, which is the dependency of methods to expand predicates (building all possible predefined relationships is difficult, especially for open domains).

The major contributions author claim (research questions) are as follows:

#RQ1) the analysis plausibility in RDF triples
#RQ2) use of distant supervision to generate negative examples
#RQ3) the synergy of combining Web-driven approaches and knowledge base-driven approaches

The proposed approach is well fitting in the scope of the SWJ journal as well as the impact of the study of the RQ are high to the scientific community. I very liked the paper, which addresses an important research problem. However, I think the paper needs some modifications/clarifications before getting accepted.

Main comments are as follows:

1-Firstly, "plausibility" in fact checking frameworks is definitely not new. It was introduced 7 years ago by Pasternack. "Jeff Pasternack and Dan Roth. 2010. Knowing what to believe (when you already know something). In Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 877-885."

2-"Saying this, human expressions can fall into two categories: true or false." -> I definitely do not agree with this claim. Human expressions are rather fuzzy by nature and can be very interpretative and subjective to other aspects, such as time span. Thus, human categories may fall into many categories, such as "partially true". For examples see http://www.politifact.com/truth-o-meter/statements/

3-In related work, authors cited frameworks for "plausibility assessment" (nonsensical and plausible) and "truth validation". It is not clear to me how the authors have performed the latest. Which features are defined? At at stage of the pipeline occurs the shifting from the first to the second phase? The paper is lacking a clear structure and different sections are mixed and hard to follow.

4-Related to this, there is a good comparison for the "plausibility assessment" step, both for training data generation process and the classification step. However, why did you not consider comparing the "truth validation" feature of your model to other related works who are design to perform this task? How better is this model ("PAUST") comparing to other models?

5-It is also not clear where exactly web searches are performed (RQ3). How do you link this data to the data you have?

6-Distant supervision is greatly limited by the quality of training data, due to its natural motivation for reducing the heavy cost of data annotation. A more deeper analysis would be valid. How better is "PAUST" in other triple score datasets? I would strongly encourage this benchmark (eg.: https://github.com/SmartDataAnalytics/FactBench)

7-[RQ3] may lead to a biased scenario. Authors argue that relation extraction methods do not achieve an optimal performance, thus "triple validation" approaches become handy in that sense, which makes very much sense. But using KB + Web to validade triples can be biased (instead of different data sources other than KBs). KBs are mostly generated by the aforementioned (automated IE) methods, i.e. you might be assessing an input claim (ie. s-p-o) based on that data it was derived from. How to deal with that? How to be sure you are not just propagating the error?

8-[RQ1] I would not use the word "understood", but "studied" instead. "Plausibility" itself is a clear issue. Practical impact of this scenario has not been fully comprehended though.

9-The title of the paper might be too long: Why not something more simple which produces the same idea, eg., as "Plausibility Assessment of Triples with Distant Supervision" or even "Plausibility Assessment of Knowledge Bases Triples with Distant Supervision", etc. Also "distant supervision" implies background knowledge per se, i.e. - it is also a bit redundant/repetitive.

10-What is the critical value defined to classify triples?

#Strengths
-The proposed framework ("PAUST") has a good range of applications and use cases candidates, such as in data quality and question answering contexts, for instance.
-Although "plausibility" has been introduced many years ago, it has not been fully studied in practical scenarios. Therefore, this paper contributes positively with this gap.
-The authors propose an interesting method to generate negative data.

#Drawbacks
-project ("PAUST") not open source/code is missing (just the dataset is published). If the target is to get the work truly utilized and appreciated - pushing the research boundaries by the community, it makes sense to release code and data.

#minor

a) "Building knowledge bases from texts requires..." -> sentence repetitive from the last paragraph.
b) "are “innately” plausible..." -> "intrinsically", "naturally", etc. may sound better
c) please consider updating [18] to "Gerber, Daniel, et al. "Defacto—temporal and multilingual deep fact validation." Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015): 85-101."
d) formula 13, what is "i" (k?)
e) formula 14, what is "xt"? (lk?)


Comments