An Analysis of the Performance of Representation Learning Methods for Entity Alignment: Benchmark vs. Real-world Data

Tracking #: 3636-4850

Authors: 
Ensiyeh Raoufi
Bill Gates Happi Happi
Pierre Larmande
Francois Scharffe
Konstantin Todorov

Responsible editor: 
Guest Editors OM-ML 2024

Submission type: 
Full Paper
Abstract: 
Representation learning for Entity Alignment (EA) aims to map, across two Knowledge Graphs (KG), distinct entities that correspond to the same real-world object using an embedding space. The similarity of the entities can be measured based on the similarity of the learned embeddings, which serves as a proxy for that of the real-world objects. Although many embedding-based models show very good performance on certain synthetic benchmark datasets, benchmark overfitting limits the applicability of these methods in real-world scenarios where we deal with highly heterogeneous, incomplete, and domain-specific data. While there have been efforts to create benchmark datasets reflecting as much as possible real-world scenarios, there has been no comprehensive analysis and comparison between the performance of methods on synthetic benchmark and real-world heterogeneous datasets. In addition, most existing models report their performance by excluding from the alignment candidate search space entities that are not part of the validation data. This under-represents the knowledge and the data contained in the KGs, limiting the ability of these models to find new alignments in large-scale KGs. We analyze models with competitive performance on widely used synthetic benchmark datasets, such as the cross-lingual DBP15K. We compare the performance of the selected models on real-world heterogeneous datasets beyond DBP15K and we show that most of the current approaches are not effectively capable of discovering mappings between entities in the real world, due to the above-mentioned drawbacks. We compare the utilized methods from different aspects and measure joint semantic similarity and profiling properties of the KGs to explain the models' performance drop on real-world datasets. Furthermore, we show how tuning the EA models by restricting the search space only to validation data affects the models' performance and causes them to face generalization issues. By addressing practical challenges in applying EA models to heterogeneous datasets and providing valuable insights for future research, we signal the need for more robust solutions in real-world applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 10/Apr/2024
Suggestion:
Major Revision
Review Comment:

This paper is focusing on the problem of disparity between real-world vs synthetic benchmark datasets used in the evaluation of embedding-based entity alignment (EA) methods. This study suggests that some of the best performing EA methods according to benchmark data, are, in fact, not doing so well in real-world datasets with high heterogeneity and many missing values. The authors also show that using only the validation data split for tuning causes generalization issues for the models.

(1) Originality: While the points raised by the authors are, indeed, open problems in the literature, yet, this paper is not the first to raise them and the authors do not explain what makes this paper different to other, similar works that raise similar problems, or even to existing EA benchmarks.

(2) Significance of the results: I completely agree with the authors that real-world data that need to be aligned are WAY different than existing EA benchmarks. Therefore, I believe that the more the papers that raise this point and show experimental evidence for that, the better it is for the community.

(3) Quality of writing: The paper is well-written, in general, and easy to follow. However, there are many points, especially in the more formal parts, like terminology and formulas, that need to be improved. Those improvement points are listed in my detailed comments below.

Detailed comments:

D1.
A main weakness is the lack of comparison to other works in the area, which blurs the originality of this paper. For example, the following statement
"There have been multiple surveys and experimental studies conducted on methods designed for entity alignment across knowledge graphs [36, 49, 50]."
is left without any further discussion on what makes this paper different to [36, 49, 50], or which new conclusions are drawn from this work that are not existing in those papers. As I am one of the co-authors of [36], I see some similar conclusions between the two papers. Take for example the following quote from [36]:
"To compare the effectiveness and efficiency of the methods in realistic settings, we have extended the testbed of datasets with pairs of KGs usually considered in related empirical studies. Specifically, OpenEA, EAE and Leone et al. (2022), Zhang et al. (2022), Chaurasiya et al. (2022) employ only datasets with a low number of entities featuring descriptions and literal values. In our testbed, we have included five additional datasets whose unique characteristics allow us to draw new insights regarding the evaluated EA methods, which were not previously reported in Sun et al. (2020), Zhang et al. (2020), Zhao et al. (2022), Leone et al. (2022), Zhang et al. (2022), Chaurasiya et al. (2022). More precisely, supervised methods like RDGCN exploiting KG relations, are outperformed by unsupervised (AttrE) and semi-supervised (KDCoE) methods that exploit the similarity of literals in datasets of decreasing density, but with rich factual information (i.e., attributes)."
Isn't that close to what the authors try to pass as a main message of this work? I am not suggesting in any way that those paper are the same. However, I believe that such commonalities, as well as differences with works like [36, 49, 50] are very important and should be highlighted.

The following research question in [36]:
"Q4. Characteristics of datasets. To which characteristics of the datasets (e.g., sparsity, number of entity pairs in seed alignment, heterogeneity in terms of literals, predicate names and entity names) are supervised, semi-supervised and unsupervised methods sensitive?"
is also very close to the main research question of this paper.

The following statement
"The study also explores entity alignment on large-scale knowledge graphs like Wikidata [51] and Freebase [52], providing insights and addressing limitations in existing benchmark datasets",
when comparing this work to Zhang et al.[35], is also not mentioning what are those limitations that are addressed in [35] and whether those are also addressed in this paper. For example, in Zhang et al. [35], the authors address the bijection (aka one-to-one) assumption made by most EA methods. Isn't that similar to the notion of heterogeneity addressed in this paper?

Another example of unclear distinction between this and other works is the statement just before the list of contributions: "In a different way to the cited studies, in this paper, we focus on analyzing the performance of embedding-based EA methods having different representation learning principles on real-world datasets together with presenting an in-depth analysis of the dataset features."
What does "In a different way" mean? What makes this paper different, explicitly? Please elaborate.

Yet another example, which is not only unclear, but I would even dare to say, not true: "the singularity of our work with respect to related studies is that it goes beyond benchmark evaluations to address the practical challenges of applying EA models to heterogeneous datasets."

Finally, the notion of semantic similarity in Section 4.2.5 is also something that has been already suggested in the literature. E.g., see "Descr_Sim" and other related features in Table 4 of [36].

In addition to those points, there also exist some very relevant works that are not cited in this paper, but they should have been. To name a few:
- Fanourakis et al., "Structural Bias in Knowledge Graphs for the Entity Alignment Task", ESWC 2023: makes the following statement "As highlighted by previous empirical studies [14,55], and also confirmed experimentally in the current study, real-life KGs are characterized by power-law distributions with respect to the number of connected components and node degrees. Our KG sampling aims to assess to what extent EA methods leave the long tail of entities of KGs under-represented in the correct matches (true positives)." and also introduces a sampling mechanism for controlling the level of (structural) heterogeneity in the produced KGs.

- Leone et al., "A critical re-evaluation of neural methods for entity alignment", PVLDB 15(8), 2022:
criticizes the use of Hits@k in EA, instead of precision and recall (see my detailed point D2).

- Mao et al., "Relational reflection entity alignment", CIKM 2020 (see my detailed point D3).

D2. For a paper that adopts a critical look (and is rightfully doing so) against the existing evaluation mechanism of EA works, I was rather disappointed to see that the authors did not make a statement about the evaluation measures used in a typical EA survey. I.e., the authors adopt Hits@k, without mentioning the weaknesses and the assumptions of this measure. They even use Hits@k in contrast to Recall in one experiment (Table 3), which is something that is met sometimes in the literature, but it deserves a more detailed description. Those measures are not the same, but they can be used, sometimes, to compare methods, in the absence of other, better ways to compare such methods. E.g., when the evaluated systems are not open source, or the available resources are not enough to run new experiments.
This criticism against Hits@k is also provided in Leone et al., PVLDB 2022 (see D1).

D3. The authors have picked very good representative methods for EA, some of which are even published in 2023 (e.g., i-Align). However, some of the best-performing EA models are not evaluated, without proper justification. I am referring to RREA (Mao et al., "Relational reflection entity alignment", CIKM 2020) and COTSAE (Yang et al., "COTSAE: CO-Training of Structure and Attribute Embeddings for Entity Alignment", AAAI 2020), cited as [77] in the references of this paper but never mentioned in the text, both of which belong to the "co-training" group, as defined by the authors, if I am not mistaken. If the authors want to use one representative method from each group, then BERT-INT is, indeed, the method that I would also choose for the "co-training" group, as it is more widely used.

D4. In the selected datasets, it is good that the authors use two real-world datasets (DOREMUS and AGROLD), but it's not clear why they do not include the mono-lingual datasets from OpenEA, which are more heterogeneous than the multi-lingual ones.

D5. Looking at the distributions in Figure 1, I made two observations:
a) very high percentage of nodes (>60%) in SPIMBENCH and AgroLD have degree 1.
b) In all datasets, the distribution of KG1 is very similar to the distribution of KG2.
Those observations, especially the latter, somehow contradict the purpose of this paper, which was to study realistic data, especially in terms of heterogeneity. I am not questioning that those datasets are real; I am questioning whether those datasets are representative of common distributions in real datasets. Please elaborate on those observations. It may even be interesting to report a measure on the connectivity of those datasets. See also D1.

D6. As a follow-up to D1: It's not clear what are the key take-away messages from this paper that have not already been provided by other works. This should also be reflected in the Conclusions section.

Minor points (roughly in order of appearance):
- The first paragraph in the intro is too long. Perhaps split in two paragraph (e.g., the second one starting with "As data originates [...]".

- "EA is a process within the realm of entity alignment": This makes no sense; especially if you consider that EA stands for entity alignment.

- "Entity alignment involves establishing correspondences or alignments between entities in different knowledge graphs that share similar or identical meanings.": This is redundant and can be omitted.

- "LogMAP [53], receiving a Ten-Year Award for its significant impact from the International Semantic Web Conference a decade ago": What you mean by "a decade ago" can easily be misunderstood. Please rephrase.

- "DLinker [...] is developed in our team and thus facilitates further experiments": This is an honest statement and understandable. It would be worth mentioning, though, that LogMap is open-source and quite easy to adapt, too, in order to facilitate further experiments.

- "(supervised, semi-supervised, or unsupervised)[34, 36]": add blank space before "[34, 36]".

- "a relation predicate is embedded as a translation vector from a head entity to a tail.": Head and tail (entities) have not been introduced

- "Finally, there are methods that some papers mention as others": I suggest to put "others" in quotes.

- "KG’s co-training models instead": "KG's" is not used correctly here.

- "These models do not need to embed entire KGs which this feature makes": Delete "this feature"

- "MultiKE augments what the number of": Delete "what"

- Description of BERT-INT: I think that you only describe the name and the neighbor views. I missed the description of the attribute view.

- "We first show the degree distribution": The in-degree, the out-degree, both?

- "visualize it for degrees of up to 10": Why 10? I think the median values should also be reported. See also my detailed point D5.

- Table 2 "(all numbers indicate percentages)": Not true (e.g., KG Sizes)

- What does "Normalized" mean exactly in Equation 2? I suggest to replace/explain it with a formula, and also describe "We then normalize the amounts of relative semantic distances and take an average over them" more explicitly, accordingly.

- In Figure 2, there are some red dots in the bottom-right corner, without any links, as well as some red point(s) in the middle-left, surrounded by light-blue points. Could you explain what those points are?

- "Therefore, using these models could not be efficient in producing the alignments in real-world heterogeneous datasets.": I find this to be a strong statement that should probably be toned-down a little bit.

- References: Please prefer the peer-reviewed versions of papers, over their arXiv versions (e.g., for [36], but this also applies to many other papers in the references).

Assessment of the data file provided by the authors under “Long-term stable URL for resources”.
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data:
The github repo does contain a general README file, as well as internal README files within each model. This is helpful, but those README files are rather poor in content. That, in addition to the lack of comments in the code, decrease the readability and usability of this repo.

(B) whether the provided resources appear to be complete for replication of experiments, and if not, why:
Yes

(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability:
Yes (GitHub)

(4) whether the provided data artifacts are complete:
Yes

Review #2
Anonymous submitted on 12/Apr/2024
Suggestion:
Minor Revision
Review Comment:

The paper investigates the performance of various embedding-based Entity Alignment (EA) models, focusing on their effectiveness in handling real-world heterogeneous datasets versus synthetic benchmark datasets. The authors explore how these models, which perform well on synthetic benchmarks, often fail to generalize to real-world datasets that feature greater heterogeneity in terms of structure, terminology, and data quality.

However, there are a few weakness and questions. (1) The authors mentioned that CEWS-WIKI and ICEWS-YAGO datasets are more similar to real-world KGs, but they didn't run experiments on these. It weakens the empirical results, so can authors run extenstive experiments on other real-world datasets or methods working on these two datasets? Additional datasets from varied domains could strengthen the generalizability of the findings. (2) In section 4.2.1, authors mentioned "It gives us a number in the range of [0,1]", which number refers here? The numbers of JS divergence in Table 2 are not in this range, can authors elaborate how to calcuate the reported numbers in Table 2? (3) While the analysis of dataset features is thorough, it might overshadow the intrinsic capabilities of the EA models themselves. A more balanced analysis could also consider the algorithmic enhancements necessary to tackle dataset heterogeneity. (4) Can the authors elaborate on specific algorithmic changes they suggest for EA models to better handle real-world data heterogeneity? It is better to propose some methods to support the findings.

Review #3
Anonymous submitted on 07/May/2024
Suggestion:
Major Revision
Review Comment:

#General

In this work, the authors attempt to present an evaluation of different Entity Alignment methods in real and synthetic data with different embedding-based methods.
Although the work is promising, I need to recommend a major revision.
Some parts of the text should be appropriately revised (spell-checked).
There are also issues regarding the evaluation, such as the choice of the datasets and misleading assertions.
However, one of the strengths of the current work is that it provides a link to the benchmark, which facilitates reproducibility.

#Abstract

"benchmark overfitting limits the applicability of these methods in real-world scenarios where we deal with highly heterogeneous, incomplete, and domain-specific data"
> This sentence is misleading. There are no signs of data overfitting in the evaluated models.

#Secontion 2 & 3

* I recommend merging sections 2 and 3.

* "In the last column of Table 2, we report the result of calculating the semantic similarity of the datasets based on the initial embedding of attribute values of the aligned entities using a pre-trained multilingual BERT model."
> Could you reference which model you are using here?

* “Note that, despite the high heterogeneity aspects of CEWS-WIKI and ICEWS-YAGO, which make them more similar to the real-world KGs, we do not run experiments on these datasets. The reason is that these two datasets lack the attribute triples and only contain the relation triples. Hence, the only model that we can employ for them is RDGCN. Because of the lack of possibility to obtain results comparable across all models for these two datasets, we leave them aside.”
> Does it mean that you only check “attribute triples”? Why so? You can generate embeddings from both attribute and relation triples, so there does not seem to be a reason not to use them.

#5.3. Evaluation Tasks

* “Besides the results of Table 2, looking at Figure 1, we can see the long-tailed issue of AGROLD’s KGs. “ the SPINBENCH also have long tail problem, why does it perform good?
> I do not see this as a reason to explain the performance differences. I found, however, the following explanation more enlightening:
“Because the names (which are the last part of the entity URIs by the model’s default) of the musical works and the proteins/genes in DOREMUS and AGROLD have been defined by some IDs in their ontologies, entity initial embeddings would not also be able to guide the embedding module to a better result. All the aforementioned observations can explain the low performance of RDGCN on DOREMUS (1.2% of Hit@1) and its very weak performance (less than 1% of Hit@1) on AGROLD”
> That means, if you want a fair comparison would be better to add the missing features to the embeddings and not only the entity names, that also made me think what would be the performance using i.e. openAI GPT embeddings which should have all those pre-trained features.

* Table 3 DLinker presents recall, so why not Hits 1 and Hits 10 as in other evaluated methods?

* “These results show that embedding-based EA models still need improvements to be able to find the aligned entities in the realistic search space of the knowledge graphs.”
> Actually what you show is that embeddings are really capable on aligning datasets using entity names, what you need to show is that if you consider the whole features in the embeddings and not just entity names how it will perform. Further, I am really curious how LLM embeddings would perform here, it will be nice to have an evaluation using LLMs instead of only BERT.

#5.3.2. Analyzing Effectiveness of Inference by the Models

* Commonly, EA methods (embeddings-based or not) reduce the search space due to scalability purposes.
“These results show that embedding-based EA models still need improvements to be able to find the aligned entities in the realistic search space of the knowledge graphs.”
> I would say a drop of 0.1 and or 0.02 in Hits 1/10 is acceptable, considering the whole dataset alignment. Also, it would be prudent to check if all entities in the target KG are in the origin KG, which can not be assessed by the currently provided information, and how the not embeddings-based method (DLINK) performs in this task.

#Conclusion

* “As future work, regarding the observed issues, in each model using the textual values of entities as input features, we will consider not only relying on lexico-semantic embeddings [27], but also on using rule-based KG embedding [102, 103]. One can consider injecting background knowledge [104] in models that mostly rely on the graph structure to enhance their performance on datasets that structurally are highly heterogeneous.”
> Consider doing it now.

#Minors

* Please revise your document. There are many sentences containing wrong verbal time and sentences that are difficult to understand.

* “However, as a representative method of the non-embedding-based group, we chose DLinker [24] because of its excellent results and the fact that it is developed in our team and thus facilitates further experiments.”
> You chose it for what?

“but not limited to the semantic web” - Semantic Web

“we select to analyze their performance and we prepare a comparison Table on the methods” -> “we selectED to analyze their performance and we preparED a comparison Table on the methods”