Detecting new and arbitrary relations among Linked Data entities using pattern extraction

Tracking #: 1454-2666

Authors: 
Subhashree Balachandran
P Sreenivasa Kumar

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Abstract: 
Although several RDF knowledge bases are available through the LOD initiative, often many data entities in these knowledge bases remain isolated, lacking metadata and links to other datasets. There are many research efforts that focus on establishing that a pair of entities from two different datasets are indeed semantically same. Also, many research efforts have proposed solutions for extracting additional instances of an already existing relation. However, the problem of finding new relations (and their instances) between any two given collections of data entities has not been investigated in detail. In this paper, we present DART - an unsupervised solution to enrich the LOD cloud with new relations between two given entity sets. During the first phase DART discovers prospective relations from the web corpus through pattern extraction. We make use of paraphrase detection for clustering of text patterns and Wordnet ontology for removing irrelevant patterns, in this process. In the second phase, DART performs actual enrichment by extracting instances of the prospective relations. We have empirically evaluated our approach on several pairs of entity-sets and found that the system can indeed be used for enriching the existing linked datasets with new relations and their instances. On the datasets used in the experiments, we found that DART is able to generate more specific relations compared to the relations existing in DBpedia.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Färber submitted on 07/Oct/2016
Suggestion:
Reject
Review Comment:

The submitted article was reviewed according to the SWJ review guidelines, which state in this case:
"Full papers – containing original research results. Results previously published at conferences or workshops may be submitted as extended versions. These submissions will be reviewed along the usual dimensions for research contributions which include originality, significance of the results, and quality of writing."

The article describes the DART system for enriching Knowledge Graphs (in the article written as "LOD cloud") with new facts, which consist of new relations between already given Knowledge Graph entities. The DART system consists of two main steps: Firstly, potentially relevant relations are gathered from the Web and are used to build patterns. Those patterns are then used in the second step, where statements (with the potentially relevant relations between existing Knowledge Graph-entities) are extracted from unstructured text. Evaluations apparently show that statements can be extracted which are not yet in the Knowledge Graph.

== Good points ==
* The article addresses an interesting and promising research area (in the light that Knowledge Graphs get more and more important and interlinked).
* The article is well-written and no significant grammatical errors were found. Hence, the quality of writing is fine.

== Weak points ==
* The originality of the presented approach is very weak. The presented approach is not very sophisticated and the methods of many components have already been developed by others (see more details below).
* Also the results -- as presented in the evaluation section -- are not convincing (see more information below). The system was compared to other systems only to a very limited extent. The evaluation results leave open whether the approach works in real-world scenarios, e.g., when dealing with large amounts of texts and relations.
* The related work section misses important references. The article neither outlines important works about relation extraction nor about Knowledge Base Population.

In the following, we present some general aspects which need to be clarified by the authors:
* The article writes about "Linked Data entities" and, hence, gives the impression that entities from different data sources might get interlinked. However, as the evaluation shows, only new links within a data source (i.e., Knowledge Graph) are established. The role of Linked Data in the article is therefore marginal and has no direct effect. I would suggest that the authors do not emphasize Linked Data, but rather speak of single Knowledge Graphs which need to be enriched. The authors might also think about revising the title of the article.
* I'm wondering why the DART system is designed to find statements where the relation part is completely new, i.e., arbitrary and not fixed. I would argue that Knowledge Graphs often already contain a given set of relations (at least on the schema level) which can be used for finding new statements. Extending the schema level (T-Box, i.e., performing ontology population) is different from Knowledge Base population on the instance level.
* The article does not address the challenges which arise when performing relation extraction. For instance, relational phrases, as extracted by DART, might be ambiguous, and multiple relations might occur between an entity pair at the same time.

We now point out single items which were noticeable when reviewing the article and which need to be considered for further submissions:

== 1. Introduction ==
* "...unless it is fully grown and updated:" Many data sources (Knowledge Graphs) will never be complete, as new knowledge is arising all the time. Hence, the sentence should be modified.
* As the focus of the article is actually not Linked Data, but only relations within one data source, the authors might think about reducing the introduction part about Linked Data and instead speaking more about information extraction from text. The text does not state so far that unstructured text documents are needed from which statements are extracted from and that there are numerous existing approaches for relation extraction (with and without a grounding in RDF). Furthermore, mentioning "Linked Data entity sets" might lead to the misconception that the entity set for subject and object of a relation might come from different data sources.
* The example :India batsman :Sachin_Tendulkar, as mentioned as output of DART, is inaccurate in the sense that not the country should stand in the subject position, but the Indian cricket team. The authors might either need to change :India to :Indian_Cricket_Team, or write it as :Sachin_Tendulkar :playesFor :India. :Sachin_Tendulkar :worksAs :batsman.

== 2. Related works ==
* In total, this section lacks pointers to important related works regarding
** OpenIE tools (ReVerb, NELL, OLLIE, WOE, ClausIE, etc.)
** methods for relation extraction in the Knowledge Base-context (see, for instance, the works [1-2], which are very similar to the presented approach in the article)
** prominent tasks and tracks such as the TREC Knowledge Base Population track and the Knowledge Base Acceleration track.
* The section contains a relatively long description of the ReVerb approach. Such a long description is not necessary (Why was ReVerb chosen among the OpenIE tools?), unless the approach has significant similarities with the approach presented later. In that case, the differences need to be described.

== 3. DART - The proposed solution ==
=== 3.1 Preprocessing ===
* What are "collections of data"? Obviously no literal triples.
* A concrete example for all steps of DART (here of the class pairs and of the entity set cross-product) might be very helpful for the readers.
* Regarding the choice of taking exactly 25% of the cross product: What were the constraints regarding the complexity and regarding the quality of the relations? I would assume that this value depends on the actual use case.

=== 3.2 Pattern-Discovery phase ===
==== 3.2.1 Extraction of patterns ====
* Constructing patterns (with the words between two entity labels and counting the frequency) is nothing new and not very sophisticated. Related work to mention here is, for instance, the BOA framework [3].
* The authors need to state whether "total number of patterns" refers to the total number of unique patterns or not. Taking 500 as threshold seems to be quite high, as not all relations might occur so often.

==== 3.2.2 Clustering using paraphrase detection ====
* It remains unclear how LESK was adopted for the presented work: Are still Part-of-Speech tags used in the approach?

=== 3.2.3 Representative pattern selection ===
* It is not stated in the article why each cluster needs a representative pattern. Is only the representative pattern of each cluster used later for statement extraction?
* Whether the representatives of the clusters are valid, is apparently not evaluated. Hence, it is unclear how the single steps of DART perform.

==== 3.2.4 Keyword extraction and Prospective relation formation ====
* My suggestion would be to rephrase "keyword" to "relational phrase".
* Are sentence borders in the relational phrases considered? This is not stated, but has an effect on the results.
* Also the performance of the Datamuse API is not evaluated. Is it an issue when a class name is combined with a word?
* "identifying a few examples under each category" suggests that this method was performed not very systematically: There might be assessments which were accidentally wrong. Can the authors elaborate on this assessment?
* Table 1: The assignment of the categories which are part of a relation and which are not seems to be not intuitive and might depend on the considered use case. For instance, why is noun.person included but not noun.location?
* It remains unclear whether all relational phrases (of the clusters) are used or only the representatives. An example of a keyword might be helpful.

=== 3.3 Triple-Finding phase ===
* When concatenating the entity in D1 with the relational phrase, the phrase found in the text might slightly vary from this concatenated string, e.g., due to additional words. Please explain to which regard this aspect can be ignored.
* "phr contains rel" in Algorithm 2 is underspecified. Please indicate how this is done. Also indicate, how the representatives of the clusters are used.

== 4. Experiments and results ==
* The description of the experimental results in Sec 4.1.1. - 4.1.3 is quite similar, so that also one section about all experimental results might be sufficient.

=== 4.1 Billionaires and Companies ===
* The considered entity sets for this experiment and the following experiments are quite small. Does the approach also scale well if we have thousands or millions of entities per entity set?
* Why don't the authors extract all triples given the entity sets?
* Why are the classes taken from YAGO, but the relations from DBpedia? DBpedia itself might be sufficient to consider.
* "Due to time and resource constraints, the recall value could not be calculated in a similar manner." This is elusive. Recall would be relevant to report in this journal article and is feasible to measure to some extent.
* The gold standard is not very comprehensible. Where does the ground truth come from? Please provide the text documents or at least information about the used texts (such as URLs).
* Regarding Table 2: The results are not very meaningful: Only 22 triples are in the ground truth and only 34 triples were extracted. Does Table 2 contain only cluster representatives?
* The article talks about enriching existing LOD data sets (Knowledge Graphs) with new relations, but does not consider the linking of the relations to existing Knowledge Base relations. Please justify.

=== 4.1.2 Rivers and States ===
* The authors argue that more fine grained relations could be found with the DART approach compared to the existing DBpedia Knowledge Base. However, it should be noted that Knowledge Graphs such as DBpedia are not designed to provide too specific relations, i.e., the existing statements are often reasonable.

=== 4.1.3 Cricketers and countries ===
* Instead of dbr:England also dbr:United_Kindom or dbr:Great_Britain might be relevant.

=== 4.1.4. Comparison with ReVerb ===
* Many arbitrary relations extracted might be already covered in the Knowledge Graphs on schema level or are irrelevant for the Knowledge Graphs. How do you tackle that issue?
* The sentence "Table 6 gives ..." is hard to understand, please rephrase.
* There are many other OpenIE systems which can be used for a comparison. Why did the authors choose ReVerb?
* The article does not consider how the relational phrases (patterns) are aligned to RDF relations. Is this planned to be done manually? Is it an issue that similar and related relations were detected (e.g., founded, founder, owned, ...)? A comparison with Dutta et al. [1,2] might be more appropriate as KB Linking is also an essential part in RDF statement extraction.

== 4.2 Application - finding missing links in the LOD ==
=== 4.2.1 Dance-forms and States ===
* Please consider that DancesOfIndia and StatesAndTerritoriesOfIndia are YAGO classes, while you might have considered only relations of the DBpedia ontology (this is not stated in the article). Please indicate which YAGO version you are using.
* The authors come up with the relations "popular in; popular folkdance of; ...", but do not describe whether those relations are the result of some filtering method, whether they are the representatives of the relation clusters, and whether methods were implemented which consider the ambiguity of relations (relations between entities of fixed entity types might still have various meanings).
* As the authors do not provide data about the ground truth creation (abstracts on the website), the creation of the ground truth is hard to track.

=== 4.2.2 Allergenic foods and Diseases ===
* Given the provided, quite limited data on the website, it is unclear to what extent the data (i.e., statements about foods and diseases) is useful for real-world applications. An idea would be to attach more context to those statements, which indicate the reliability and accuracy of the statements (e.g., provenance information). An evaluation whether the extracted statements are identical to the statements in the retrieved text might be good.

== 4.3 Discussion ==
* Various items of the listing in this section are not surprise findings. To name a few:
* The first item apparently indicates that there are many more potential statements "out there" which are missing in the respective Knowledge Graph. This is not surprising and concrete numbers are missing.
* The fact that the authors only check those objects which are given for the relation and for the subject seems to be relatively trivial.
* The listing of the prospective relations which were gathered also via noun categories indicates for me that not only removing nouns per se is advisable; instead, exploiting the associations with the relational phrases seem to be good. Is this correct?

= 5.Conclusions =
* As indicated above, instead of non-grounded statements (retrieved via OpenIE), it might be more reasonable to directly extract RDF triples out of text (see, for instance, existing approaches such as [4-5]).

In summary, the presented article tackles an interesting research problem, but even if the authors respond to all comments, I fear that the research contributions of this work are not enough for a journal publication.

[1] Arnab Dutta, Christian Meilicke, Heiner Stuckenschmidt. Semantifying Triples from Open Information Extraction Systems. Frontiers in Artificial Intelligence and Applications, 2014.
[2] Arnab Dutta, Christian Meilicke, Heiner Stuckenschmidt. Enriching Structured Knowledge with Open Information. WWW 2015.
[3] http://aksw.org/Projects/BOA.html
[4] Isabelle Augenstein, Sebastian Padó, Sebastian Rudolph. LODifier: Generating Linked Data from Unstructured Text. ESWC 2012.
[5] Michael Färber, Achim Rettinger, Andreas Harth: Towards Monitoring of Novel Statements in the News. ESWC 2016.

Review #2
Anonymous submitted on 22/Nov/2016
Suggestion:
Major Revision
Review Comment:

The paper proposes an unsupervised method to identify arbitrary relations between two entity sets of Linked Data by exploiting information extracted from the text.
In particular, the method relies on a search engine in order to find candidate sentences containing both entities and tries to find a linguistic pattern that links the entities. Then, patterns are clustered, filtered and a keyword extraction method is exploited to extract possible relations.

The method is innovative and interesting for two reasons: 1) it is able to find arbitrary relations taking into account two sets of entities from two different linked data sources; 2) it is completely unsupervised.
The paper is generally well written; but, I suggest accurately checking the English by a native speaker.

However, there are some issues:
1) The method is completely unsupervised but it relies on a large number of heuristics and parameters. How to find the correct values for these parameters could be a problem. These parameters can change according to the domain;
2) The similarity is computed by Lesk. There are more effective measures. The Lesk measure reports the worse performance in the paper that you cited ([18]). Moreover, knowledge-based measures (like Lesk) could have a low coverage due to the knowledge-base, which is usually WordNet.
I suggest investigating corpus-based measures or measures based on word embeddings.
3) As for the previous point, your method is heavily based on WordNet. WordNet could not contain some relevant terms.

Another weak point concerns the evaluation. In particular, there is not a solid baseline approach. ReVerb, which is the tool that you adopted as the baseline, is developed for a goal different from the focus of your evaluation.
The evaluation on the "Allergenic foods and diseases" dataset is uselessness since you are cannot provide both Precision and Recall.
Maybe a more accurate error analysis should be reported. For example, on the "Cricket and countries" dataset you report a low Precision and a good Recall: maybe your method introduces false positive relations. Please, provide more details about this.

Minor issues:
- Wordnet -> WordNet
- world wide web -> World Wide Web
- 10000 -> 10,000. Please, use the ',' to separate thousands
- "two phases Pattern-Discovery and Triple-Finding" -> "two phases: Pattern-Discovery and Triple-Finding"
- Please, use the mathematical environment for referring variables and parameters in the text
- Your keyword extraction method tries to map nouns to categories by using WordNet noun categories. How does your algorithm determine the category when the noun does not occur in WordNet?

Review #3
By Andrea Giovanni Nuzzolese submitted on 31/Jan/2017
Suggestion:
Reject
Review Comment:

The paper presents DART, an unsupervised solution to enrich the Linked Open Data cloud with new relations. The relations are generated starting from two linked datasets by applying a two steps approach consisting of:
* a pattern-discovery phase aimed at detecting prospective relations;
* a triple generation phase, which relies on the previous phase by applying prospective relations in order to generate new triples for a given pair of entities.
Both the pattern-discovery and the triples generation phases are built on top of two natural language processing workflows. In fact, (i) patterns are discovered from paraphrases detected form sample sentences that contain entities of target types and are gathered from a search engine and (ii) triples are generated by analysing the sentences gathered in a similar fashion from a search engine and matched against detected patterns.

The paper is in general well written and focuses on a challenging topic, which is relevant to the journal.
Nevertheless, the paper shows a number of weaknesses that, in my opinion, prevent its publication as it is in its current form. Namely, those weaknesses involve, the lack of clarity in many parts of the paper, the significance of results and the state of the art.

=== Lack of clarity
The level of clarity is not appropriate for a paper submitted as full research paper.
This strongly affects the quality of the paper and the reproducibility of results.
In fact, the authors hide a lot of details that, instead, are worth to be provided. For example:
* the authors do not provide any evidence that supports the finding “The marginal utility gained by taking a sample size bigger that 25% of the cross product was very less” (i.e. end of Section 3.1);
* it is not clear what is the rationale used for defining the values used for N (i.e. the number of the web results returned by the search engine) and k (i.e. the number of patterns selected) in Section 3.2.1. N and k are empirically set to 100 and 25%, respectively. However, the authors do not provide any description about those values were obtained;
* in the both the pattern discovery and triple generation phases the algorithms make use of a search engine. What is this search engine?
* the maxSim function should be provided in a more formal way;
* the rationale behind the use of part-of relations for extracting keyword is not provided. Why did the authors focus on part-of relations? Additionally, it is unclear how the classification of part-of relations from the categories returned by WordNet was carried out;
* the value M=60 (cf. Section 3.3) is asserted to be empirically chosen, but no empirical data is provided;
* the authors explain that their approach prevents the generation of triples when multiples entities from D2 (cf. Section 3.3) co-occur in a sentence. However, this design choice seems to introduce a strong limitation to DART as co-occurrences might happen quite frequently in a target sentence. This part should be better discussed and strengthened further.

=== Significance of results
The evaluation was performed on relatively small datasets consisting of few sample classes.
It is unclear how the comparison between DBpedia properties and the relation produced by DART was computed in order to record precision and recall. For example, Table 2 shows properties with different lexicalisations that require some matching in order to be compared.
Additionally, the authors argue that they identified the value for length limit used for relation phrases (cf. Section 3.3) empirically. This value is set to 10, but no empirical evidence is provided.
The comparison between ReVerb and DART shows that ReVerb generated 96 relations out of which 11 were correct with respect to the ground truth. However, it would be interesting to analyse relations returned by DART and ReVerb in terms of correctness with respect to the context of the sentences they were extracted from. For example, how many out of 96 relations generated by ReVerb could be really extracted from the text?

=== Related work
The related work should be significantly reworked in order to provide more detailed comparison between solutions form the state of the art and DART.
Additionally, an important missing reference that tackles a similar problem is Legalo [1]. This work should be cited.

[1] V. Presutti, A. G. Nuzzolese, S. Consoli, A. Gangemi, and D. Reforgiato Recupero. From hyperlinks to Semantic Web properties using Open Knowledge Extraction. Semantic Web Journal 7(4): 351-378 (2016). DOI: 10.3233/SW-160221


Comments

The authors have proposed a system called DART to generate arbitrary relationships between instances of two classes taken from two different datasets. The approaches mentioned in the paper is novel and results are appealing. The system utilized a number of existing APIs elegantly for improving the accuracy of the results.

The paper is well organized and not much readability issues.

Minor changes:
Works such as ([22], [21]), [29], ([31], [24], [23] and [25]) and [16]..
I would write: Works such as [21,22,29,31,...] (That is, \cite{a,b,c}) You may follow this template throughout the paper.

...a few other works such as [30], [13], and [14] *propose using* various semi-supervised approaches for the same // rewrite: such as [13,14,30] suggested various..

The wording *relation* in Section-2 para-2 is confusing. You can clearly use the ontology-terminology for binary relationship "predicate"//Suggestion

The other *aspect-detecting* arbitrary new relations// not clear

Italicize the key terminologies when first introduced. Say *arbitrary relations*

Section 3.2.1 should be explained using an example. On reading this section, a reader would require clarity on how would a pattern look like.

The top N (N=100, chosen empirically) web results are obtained.// Is there any significance in using a particular search engine. Since you are taking the top 100 results, using a different search engine (which typically uses a specific page-ranking algo.) may affect your results.

equation (1)// Equation-1

In Eq.1, Similarity of patterns, the patterns are represented using T1 and T2, also it says w \in T1 and w \in T2. What is implied by T1 and T2? A set, a sentence, a list or a tuple?

Assuming the pattern is of the form (a,b,c), consider that you are giving the following two pairs as input to sim():
(MJackson,famous for,popmusic)
(MJackson,king of, popmusic)

According to the equation, average of sum(sum (the mean similarity of MJakson to all the terms in *MJackson,king of, popmusic*, mean of *king of* to all the terms in *MJackson,king of, popmusic* and mean of *popmusic* to all the terms in *MJackson,king of, popmusic*), sum( the mean...)) will be considered for clustering. It is probable that, as the *MJackson,popmusic* portion is same for both the patterns, a comparison of (the words) MJackson to popmusic and popmusic to MJackson, may diminish the similarity score.// There is some sort of confusion here, whether T1 and T2 include *MJackson,popmusic* or T1 and T2 are *famous for* or *king of* respectively. Formally define T1, T2, and the so called *patterns*. You may also include an example, for adding more clarity.

Following our algorithm, by appending the prepositions/class names/qualified nouns to this keyword, the prospective relation formed is: flows through.// More clarity is required. Again, an example will be useful.

Section 4.3
..we check if each LHS entity connects through a prospective relation to any RHS entity and repeat the same from other side, thus making it a linear process.// It is not clear as in how it becomes a linear process.

It would be interesting if you could briefly mention about the time taken for each of the experiments -- I am aware that performance of the DART is not the prime focus.

Considering the overall scope of the work (i.e., enrichment of LOD), while introducing new relationships, a comparison check should be done to avoid duplication of predicates in the LOD. Also, the context of the newly introduced predicate needs to be verified to sync with the entire LOD. Some of these issues could be looked at as a possible extension that may be considered in the future.