Genetic-Fuzzy Programming Based Linkage Rule Miner (GFPLR-Miner) For Entity Linking In Semantic Web

Tracking #: 1663-2875

Amit Singh
Aditi Sharan

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Semantic Web Data Sources follow Linked Data principles to facilitate efficient information retrieval and knowledge sharing. These data sources may provide complementary, overlapping or contradicting information. In order to integrate these data sources, we perform Entity Linking. Entity Linking is an important task of identifying and linking entities across data sources that refer to the same real-world entities. In this work, we have proposed a Genetic Fuzzy approach to learn linkage rules for Entity Linking. This method is domain independent, automatic and scalable. Our approach uses fuzzy logic to adapt mutation and crossover rates of Genetic Programming to ensure guided convergence. Our experimental evaluation demonstrates that our approach is competitive and make significant improvements over state of the art methods.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Jul/2017
Review Comment:

The paper concerns a method to learn linkage rules based on a genetic programming approach. The proposed method is endowed with a mechanism to adapt some parameters involved in the genetic procedure by resorting to a fuzzy logic controller. A number of experimental evaluations should prove the suitability of the authors’ proposal. However, my general impression of the paper is not positive, as I am going to argue with reference to the main directions suggested for reviewers.
As concerning originality, the authors’ proposal is grounded in a basic idea which is not new. In fact, some other approaches in literature employ genetic programming for tackling the entity-linking problem. In particular, the authors admittedly adopt a setting presented by some other researchers both to formalize the problem and to build up the basics of the genetic procedure (most notably, individual representation). However, some excerpts of the paper appear to be very similar to some material published in other papers (such as [16], [26]), with the single exception of being less accurate both in style and in presentation. See, for instance, the definitions in Section 3 and Figures 3-4: the latter, in particular, are simply a rough copy of some figures reported in [16] and [26].
The issue related to a dynamic adaptation of crossover and mutation rates is not a novel one in the context of evolutionary computation. Here, the authors introduce an original approach based on the employment of fuzzy logic, yet I will discuss later some issues related to their proposal.
The authors state also that a major contribution of theirs consists in the use of double tournament selection as a rule pruning strategy to avoid the bloating effect while applying their genetic programming technique. I should disagree on this point since such an approach has been already proposed by Luke and Panait in their 2002 paper titled “Fighting Bloat With Nonparametric Parsimony Pressure” (the paper can be found on Volume 2439 of Lecture Notes in Computer Science and it has not been included in the reference list).
Finally, scarce originality can be found also in the organization of the experimental session: the authors renounced both to design a precise experimental set-up and to perform a plain comparison with other approaches proposed in literature: apparently, they limited themselves to simply report some scattered results retrieved in literature in an incoherent fashion (further details on the results discussion will follow later).
As concerning significance of results, several issues may be raised. First of all, I would like to express some perplexity on the basic idea to resort to genetic programming for tackling the entity linking problem. As stated before, I am aware that this kind of approach has been previously proposed in literature. However, the present paper lacks the necessary characterization to express the suitability of genetic programming in this context. See, for instance, the quick description provided in Section 4.1: genetic programming is described in a way which is indistinguishable from classic genetic algorithms. Conversely, genetic programming has a sharp characterization since it must be intended as the specific evolutionary paradigm oriented to explore the space of programs. In other words, population in genetic programming should be composed of executable programs: such a key-point is totally neglected by the authors (as an afterthought, they only refer to the necessity of representing individuals in form of trees). The reader is introduced to the employment of this particular technique without any kind of justification and genetic programming is viewed simply as a mean totally detached from its inherent significance: I am convinced that some explanatory words would have been beneficial in this sense.
A second point concerns preliminaries in Section 3 that are essential to fully grasp the entity linking problem. Unfortunately, they have been overlooked by the authors. The definitions reported at the start of Section 3, for example, are somewhat misleading. The first one (Entity Linking) introduces a relation R which is characterized as relating all entities representing the same real-world object. As a first note, R is scarcely characterized and maybe it could be better expressed on a formal plane as a subset of the Cartesian product S x T, instead of introducing an additional set M. Moreover, some confusion is introduced by the following definition (Linkage Rule): it appears to be related to the previous one, but such a relationship is not clearly expressed. In fact, provided that "Entity linking" has been vaguely expressed as the task to find the subset M, here the authors state that "Entity linking decision depends on the similarity score returned by the Linkage rule L". However, no formal relationship is reported between R and L. Please note also that while going through this couple of definitions, two different expressions of the same set M are offered to the reader. Finally, in the third definition (Reference Links) the concepts of "true matches" and "true non-matches" are introduced without being properly explained: which kind of subsets are referred to by the authors? How do they relate to the set M and the function L introduced previously? How do they "help in linkage rule generation"? Maybe a better formalization of all the three definitions would be beneficial for a better comprehension of such crucial key-concepts.
The same kind of careless description affects also the following sections. In section 4, the reader is told that the search space consists of similarity functions, similarity thresholds, transformation functions and aggregations functions. However, all those concepts are never properly introduced by the authors and a list of all those functions is neither provided. In this way, it is very hard to follow the subsequent description devoted to individual representation (which is quite obscure indeed). Some further difficulty is added also by the formal description of the operators. Property operator is supposed to operate on the entity property, yet it is simply expressed as F_P (how can be established the relationship with the specific property? Maybe a subscript k = 1,…,n is needed). The same goes for the Field Comparison Operator which is simply expressed by F_FC (while it is provided with the subscript in the following description of F_AGG, where F_FC1, F_FC2, …, F_FCk are involved - why the sequence stops at F_FCk instead of F_FCn? Moreover, why the ubiquitous use of “F”? Wouldn’t it be simpler to call the operators “P”, “T”, “FC”, and “AGG” with their related subscript?). Most importantly, how does the Field Comparison Operator provide a value (score) given a couple of similar properties? Such a fundamental question, concerning the rationale of the operator, remains unsolved through the paper. In a similar way, the reader is informed that the Field Transformation Operator is useful to make data consistent for comparison, but this transformation is not explicated, as well as the reason to perform such a transformation process. In this sense, the example reported at the end of Section 4.1.1 is a missed chance for clarifications: it is too poorly described for being illustrative enough (what is “Numeric Comparator”?). At the end, the reader is still wondering how the similarity score is evaluated and how individuals are materially represented.
When we come to the description of genetic operators, the overall comprehension is compromised by the previous issues. Yet a number of problems affect also this specific section which starts with cloning. Such a mechanism is mentioned in this one and only place through the paper, without specifying if it is part of the overall strategy proposed by the authors. The description of some operators (such as “Transformation crossover operator” and “Combine operators’ crossover operators”) are totally unclear to me. Even if their introduction has been highlighted as one of the major contributions of the work, their discussion has been quite disregarded. The term “distance” (never mentioned or defined before) plays a relevant role in this section, but its characterization is obscure (the same happens for some other concepts: for example, what are the “data transformation paths”?). Even if I was not able to understand the description of these operators, I still notice that they have been introduced without any regard to their application in a genetic programming context: this is in line with my previously expressed perplexity on that issue.
In Section 4.2 the double tournament selection is mentioned, once again without descending into details. In general, tournament selection should be characterized at least in terms of: i) tournament size; ii) deterministic/non-deterministic tournament. Of course, the double round of selection would have deserved some more details too.
In Section 5 there are some minor issues which I am going to mention very briefly:
Figure 6 can be hardly intended as a description of the main parts of a fuzzy system (as asserted by the authors)
“Standard deviation of population” is quite nonsensical and I believe it should be “standard deviation of the population fitness”
The fitness function, previously expressed as “F”, is indicated here as “f”
The formalization of standard deviation (top of page 10) is erroneous: in the square brackets you should write “f_k(t) - f_avg(t)”
The normalization of the variations of p_c(t) and p_m(t) (in ranges [-0.1, 0.1] and [-0.01, 0.01], respectively) is mentioned without explanation
The main concern in this section, however, is related to the fuzzy controller and its fuzzy rule base. In practice, Tables 2-3 embed the fuzzy rules which should reflect the heuristic described in the text. Unfortunately, such heuristic is described in a very hasty fashion and a number of issues are unclear. First of all, the indices employed as input values for the fuzzy controller (that are the variation of average fitness and the variation of the standard deviation) are questionable. In some other works, different criteria are adopted. For example, in the paper
M. Srinivas, and L. M. Patnaik, Adaptive Probabilities of Crossover and Mutation in Genetic Algorithms, IEEE Transactions on Systems, Man and Cybernetics, 24(4), 1994
the difference f_max - f_avg (with f_max being the maximum fitness value of the population) is adopted as a parameter for detecting convergence in genetic algorithms. Conversely, the heuristic proposed by the authors evaluates how “fast” the evolution process is in terms of the variation of the average fitness, and how “diverse” the population is in terms of the variation of the standard deviation. That leaves room for the question: how to handle negative variations? In principle, a “fast” evolution process should be correlated to the magnitude of the average fitness variation, regardless of its sign (the same goes for the “diversification” of the population and the magnitude of the standard deviation variation). Conversely, the authors overlooked this problem and large negative values have been apparently confused with reduced amounts of variations.
Additionally, how are the rules crafted? It is not clear the way the different labels “Positive large”, “Zero”, “Negative medium”, etc. have been positioned in the tables. The use of fuzzy logic should be correlated to the compilation of knowledge expressed in a comprehensible form, possibly deriving from the expertise of humans. It would be interesting, in this sense, to learn how to treat some special cases related, for instance, to the “Zero” value or to the diverging sides of the heuristic (i.e., when the crossover/mutation rate should be increased and decreased at the same time, according to the adopted indices). Moreover, the reader would expect Tables 2-3 to be identical, since they are identically described in terms of the discussed heuristic (actually, there is a typo at the end of the description, where “increase crossover rate” should be “increase mutation rate”). Conversely, the tables slightly differentiate just in the last three columns and no justification is provided for such a mismatch.
Finally, another relevant issue has been apparently neglected by the authors. Since crossover/mutation rate has to be increased whenever evolution process is “slow”, population could never converge to a global optimum. This problem is known in literature and it has been variously tackled (as an example, the previously mentioned work by Srinivas and Patnaik can be referred to). All these observations undermine the specific approach for fuzzy adaptation proposed by the authors.
Finally, significance of results can be scarcely appreciated while reading the final section which is quite disappointing. The discussion of the experimental results essentially consists in an iteration of almost identical sentences that have been copied and pasted to comment all the attached tables and figures. The discussion is uninformative and a number of questions remain unresolved. I would like to specify only a few of them:
The authors mention a “cross-validation” performed to test their GFPLR-Miner approach. What kind of cross-validation has been performed? What is its proper setting?
Is the cross-validation approach adopted to test also the alternative approaches proposed in literature? That should be mandatory for a fair comparison.
More generally, what is the origin of the performance results reported for the alternative approaches? It appears that the authors, rather than framing the comparison in a structured experimental setting designed for their work, simply reported some scattered results coming from literature.
Tables 7-14 report a number of rows corresponding to different iteration values. What is the rationale behind this setting? Why is such a range of iterations limited to the GPLR/GFPLR approaches?
What is the meaning of “F1” reported at the top of Tables 7-14?
What can be argued from the plots reporting average fitness/best fitness values related to GPLR and GFPLR? The illustrated trends appear to be very similar (the authors report a quick note about a different curve smoothness that I would not agree upon, at least not for every plot).
Besides the poor discussion, there are a couple of issues which are quite critical. First of all the comparison appears to be extemporized, with different settings chosen almost randomly for different groups of datasets. In this way, the reported comparisons, far from being properly designed, are offered to the reader in a heterogenous way: sometimes results are compared with other genetic programming mechanisms, sometimes results are compared with some other completely different approaches. On the top of that, the authors did not provide any comparison with some other evolutionary methods implementing adaptive attuning of crossover and mutation rates: that would have been interesting for assessing their mechanism of fuzzy adaptation.
Moreover, comparison is performed only in terms of performance results (with the additional problem of a cross-validation which is obscure both in its settings and in its scope of application, as highlighted before). In evolutionary search, instead, computational cost is a crucial issue and no hints are offered on that side.
Finally - maybe, most importantly -, I should disagree with the authors when they claim that GFPLR-Miner “gives an impressive performance” and “outperforms many state of the art systems”. The reported results, in fact, should be read of course in terms of the obtained performance on the validation set. In this sense, if we refer to Tables 7-14 we note that GFPLR-Miner always provides results worse than competitors for datasets D1, D2, D3, D4, D5. In the remaining cases (D6, D7, D8) GFPLR-Miner greatly outperforms its competitors. However in those cases the competitors are heterogenous methods (no evolutionary approaches are involved in such comparisons) and show performance results that are inexplicably low: some more detailed comments would be beneficial to present a plain outline of the overall experimental setting.
All things considered, therefore, the presented results cannot be regarded as fully acceptable or satisfactory.
As concerning quality of writing, the paper presents a number of flaws (I am not considering the problems with the mathematical formalization discussed before). Even if such flaws do not affect dramatically the overall comprehension, they do hamper the reading process. As a first consideration, the authors make erratic use of capital letters, which is quite annoying. Punctuation should be also revised: several commas are misplaced and “:” should be used in place of “-”. Some sentences appear to be repetitions of something already asserted: see, for instance, the text paragraph in the introduction starting with “Entity linking in data sources is a non-trivial problem…”. In several cases the verb “to adapt” has been misspelled as “to adopt” and the acronym “GFPLR” has been written as “FGPLR”. Some sentences should be rephrased for the sake of clarity. See, for instance, the first sentence of Section 3.1; the descriptions involved in Section 4.1.2; the last sentence of Section 5 (before Section 5.1). Mathematical symbols in the text are almost everywhere erroneously reported, with subscripts becoming scripts (e.g., “p1” is written in place of “p_1”). Most importantly, several tables and figures are not mentioned in the text. Among them, the flowchart in Figure 11 is somewhat troublesome: the block “Measure the individual fitness” is isolated, as well as the offspring fitness in the Fuzzy Logic Controller. Also, the stopping criterion is not coherent with the one reported in the text. Figures 3-4 would have deserved better care in their description and a suitable discussion in the text. Finally, several bibliographical references are incomplete or erroneously reported.
For all the above considerations, I regret to consider the paper not suitable for publication and I would suggest its rejection.

Review #2
Anonymous submitted on 21/Aug/2017
Major Revision
Review Comment:

This paper proposes a genetic-fuzzy programming approach (GFPLR) for the entity linking problem (automatically finding equivalence relations between entities from different Linked Data sources). While there are some previous approaches based on genetic programming (GPLR), the use of genetic-fuzzy programming is novel to the best of my knowledge. The approach is evaluated using 8 datasets and compared with (i) a genetic programming approach without the fuzzy component and (ii) 6 existing software tools solving the entity linking problem.

This paper is original to my knowledge and the topic is relevant for the journal audience. However, it has some important problems (related to significance of the results and quality of writing) that make think that is not ready for publication at least at its current stage.

1. While the experiment results show that GFPLR outperforms GPLR in terms of f-measure and convergence speed, there is not evidence that it outperforms some of the existing systems. The authors claim that their approach "gives an impressive performance as compared to many other states of the art systems", gives "comparable performance to other state of the art systems for datasets D3-D4" and "out performs (sic) many state of the art systems" on the other datasets D1-D2 and D5-D8. The problem is that the authors consider the results with the training data instead of the results with the validation data. By taking validation data, the novel approach gives worse results than Raven (D1-D2), GenLink (D1-D2), Carvalho et al (D2), SemiBoost (D2-D4), FEBRL (D5) and MARLIN (D5). Note in particular that the novel approach is not comparable to existing approaches on D3-D4; it is significantly worse. Thus, a much careful analysis is needed.

2. It is not clear if each experiment has been repeated several times to avoid the effects of randomness. According to the text, it does not seem to be the case. The statistical analysis should take into account the best results in each of the repetitions, not the results in each generation of a single experiment because only the best result is relevant.

3. Box plot also need to be explained in more detail, and the comparison between GFPLR and GPLR also needs more care. Consider for example Figure 25. According to the text, "from the box plot, we can clearly see that there exists a significant performance difference between these two approaches". Actually, if one restricts to the three higher quartiles GPLR seems a better choice, so how one does choose between different box plots?

4. The claim that GFPLR "ensure fast convergence of proposed model" is not proved theoretically; this seems to be the case in the 8 datasets but what about the general case? Can one ensure that it will converge always?

5. The approach is not evaluated in terms of running time. It could be the case that small differences in the f-measure require a big cost in terms of time.

6. One of the challenges mentioned in Section 3.1 is obtaining "domain-independent techniques" but it is not clear if adapting this approach to other domains is easy. Fuzzy rules in Tables 2-3 have built by "simple heuristic" and I guess that so have been the membership functions in Figures 7-9, but are these fuzzy membership functions and rules globally valid? They have not been explicitly validated.

7. Other defuzzification techniques or fuzzy operators different than max-min have not been considered. This needs to be discussed at the very least.

8. Although writing is acceptable in general, the text requires a careful revision. Firstly, there are several repetitions, such as the definition of f-measure (pages 8 and 15), the number of positive and negative reference links (Sections 6.2.1 and 6.2.3 and Table 7), or in references [14], [17], [29]. Secondly, several sentences are not in the right place, such as sentences 3-4 in Section 6.1.2 ("parameters") that do not describe parameters, or descriptions of the dataset that should be moved to Section 6.1.3 (sentence 2 in Section 6.2.1, sentences 1-3 in Section 6.2.2...). Thirdly, there is not enough background on fuzzy logic, for example it is never explained what a degree of membership, so the last paragraph of Section 5.1.1 could be hard to understand by the average reader.

9. Citations are also not ordered several times in the paper. Sometimes they authors use present perfect tense when they should use simple present tense or future, such as in "we have proposed" (page 2) or "we have shown" (page 16). Bibliography needs a revision to unify the conference acronyms and, more generally, the citation of conference papers.

10. There are too many figures: consider joining Figures for GPLR and GFPLR (for example, Figures 13 and 14) in a single one.

Minor comments:

* p. 5. "genetic programming has only been applied by [25], [26] and [16]". Citations here should be "[16], [17], [25] and [26]".

* p. 6. "to make data consistent for comparison": please clarify

* p. 9. "section-" -> "section."

* p. 10. "such as" -> "namely"

* p. 10. "mitigate uncertainty" -> approximate reasoning deals with imprecision and vagueness, not really with uncertainty

* p. 12. "iv. Apply membership function" -> "iv. Apply defuzzification"

Review #3
Anonymous submitted on 22/Aug/2017
Review Comment:

The manuscript proposes an approach to entity linking in the Semantic Web based on genetic programming (GP) where two important parameter of the algorithm, namely mutation and crossover rate, are dynamically adapted by means of a fuzzy controller
The authors follow previous approaches to entity linking based on GP, cited in Section 2, in that GP is used to learn from a set of examples a similarity function to be applied to pairs of resources in order to decide whether they represent the same entity or not.

The main original contribution of the authors is to make such GP "adaptive", by using a fuzzy controller to tweak the mutation rate and the crossover rate, which are two important paramenter of GP, so as to accelerate the convergence of the evolutionary process toward the global optimum.
Two other contributions cited by the authors are the use of double tournament selection to avoid bloat (a well-known phenomenon in GP, whereby a tendency is observed toward evolving increasingly large and complicated solutions) and the design of a syntax-aware crossover operator.
Now, none of the three contributions can be said to be "original": they have been known in the field of evolutionary computation for some twenty years, as I will argue below. However, the authors do not give credits to their sources, which is a mortal sin when writing a scholarly publication. What may be considered an original contribution, but then a minor one, is the idea to apply them to this specific problem of entity linking arising in the field of the Semantic Web.

Let us now discuss, one by one, the three contributions.

1. The idea of using a fuzzy controller to govern the evolutionary process, of any evolutionary algorithm, not just GP, is quite old. To the best of my knowledge, the idea was proposed in 1993 by Lee and Takagi [a,b,c], Bergmann, Burgard, and Hemker [d], and further elaborated upon by Arnone, Dell'Orto, and Tettamanzi in 1994 [e]. Surveys and overviews can be found in Herrera and Lozano 2003 [f] and in Chapter 7 of the book by Tettamanzi and Tomassini 2001 [g].

2. The idea of using double tournament as a bloat-control mechanism was discussed, along with many others, by Sean Luke and Liviu Panait in a journal article published in 2006 in the most prominent journal on evolutionary computation [h].

3. What the authors call a "specific" crossover operator is in fact a syntax-aware crossover operator. Proposals of this sort are too numerous in the literature on GP to give a list of citations here. On the other hand, there exists a flavor of GP, called "Grammatical Evolution" [i], which is built around this very idea. The authors do not seem to be aware of it and might want to look into such a method as a possible (and more general) alternative to their ad hoc solution.

The authors do not look familiar with the notion of syntax and grammar. Figure 3 seems to me a rather confusing attempt at describing the syntax (not the structure!) of a linkage rule, resorting to an UML diagram instead of what would be the most natural tool in such case, namely a BNF grammar, which would read something like:

LinkageRule ::= SimilarityOperator
SymilarityOperator ::= AggregationOperator | FieldComparisonOperator
AggregationOperator ::= AggregationFunction "(" ListOfSimilarityOperators ")"
... etc.

In general, the whole article suffers from a lack of scholarship. Many citations are missing in places where a reader would not only expect them, but also need them. For instance, and in additions to the citations giving credits for the sources of inspiration of the three "contributions" discussed above, the authors mention Mamdani's fuzzy inference without giving a citation for it. Another example is the center of sums defuzzification method.

Another important weakness of the article is the lack of technical soundness in the way the experimental protocol is presented (and thus designed?) and results are reported.
The authors do not specify the number of runs of their GP algorithm per experiment. This and other evidence lead me to suspect that the results reported in the manuscript stem from a single GP run, which is too little to say anything about any evolutionary algorithm. That is because GP is a stochastic method which may give different results every time it is executed; therefore, what the reader needs to be able to evaluate its performance is an idea of the probability distribution of the results. A rule of thumb which is generally accepted in the field of evolutionary computation is that at least 30 runs for each experiment (i.e., parameter setting, problem instance, etc.).
As a consequence, it is not clear what is plotted in Figures 15, 18, 21, 24, ... (those with box plots): one would expect each box to represent the distribution of the fitness of the best solution over a number of runs; however, if only one run was performed, then I understand each box would represent the distribution of the best fitness for each generation (= iteration) during that single run. I'm also afraid of how the p-values have been computed to test the significance of the difference of performance: everything leads me to suspect that the authors compared two populations consisting of the best fitness values for each iteration for GPLR Miner vs. GFPLR Miner. If that's the case, the comparison is not meaningful. What is interesting (and should be compared) is just the final result for each run of either method!
In view of these issues, nothing can be said, based on the information given by the manuscript, about the significance of the results.


[a] M. Lee and H. Takagi. "Dynamic Control of Genetic Algorithms using Fuzzy Logic Techniques", in Proc. of the 5th Int'l Conf. on Genetic Algorithms, 1993.
[b] M. Lee and H. Takagi. "Embedding Apriori Knowledge into an Integrated Fuzzy System Design Method Based on Genetic Algorithms", in Proceedings of the 5th IFSA World Congress (IFSA'93), Vol. II, pp. 1293-1296, 1993.
[c] M. Lee and H. Takagi. "Integrating Design Stages of Fuzzy Systems using Genetic Algorithms", Proceedings of the 2nd International Conference on Fuzzy Systems (FUZZ-IEEE'93), Vol. I, pp. 612-617, 1993.
[d] A. Bergmann, W. Burgard, and A. Hemker, "Adjusting Parameters of Genetic Algorithms by Fuzzy Control Rules", in New Computing Techniques in Physics Reserach III, 1993.
[e] S. Arnone, M. Dell'Orto, and A. Tettamanzi. "Toward a Fuzzy Government of Genetic Populations", in Proc. of Int'l Conference on Tools with AI 1994, pp. 585-591.
[f] Francisco Herrera and Manuel Lozano. Fuzzy adaptive genetic algorithms: design, taxonomy, and future directions. Soft Computing 7(8): 545-562 (2003).
[g] Andrea Tettamanzi and Marco Tomassini. Soft computing - integrating evolutionary, neural, and fuzzy systems. Springer 2001, ISBN 978-3-540-42204-4.
[h] Sean Luke and Liviu Panait. A Comparison of Bloat Control Methods for Genetic Programming. Evolutionary Computation 14(3): 309-344 (2006).
[i] Conor Ryan, J. J. Collins, Michael O'Neill. "Grammatical Evolution: Evolving Programs for an Arbitrary Language". EuroGP 1998: pp. 83-96.