Review Comment:
This paper is focusing on the problem of disparity between real-world vs synthetic benchmark datasets used in the evaluation of embedding-based entity alignment (EA) methods. This study suggests that some of the best performing EA methods according to benchmark data, are, in fact, not doing so well in real-world datasets with high heterogeneity and many missing values. The authors also show that using only the validation data split for tuning causes generalization issues for the models.
(1) Originality: While the points raised by the authors are, indeed, open problems in the literature, yet, this paper is not the first to raise them and the authors do not explain what makes this paper different to other, similar works that raise similar problems, or even to existing EA benchmarks.
(2) Significance of the results: I completely agree with the authors that real-world data that need to be aligned are WAY different than existing EA benchmarks. Therefore, I believe that the more the papers that raise this point and show experimental evidence for that, the better it is for the community.
(3) Quality of writing: The paper is well-written, in general, and easy to follow. However, there are many points, especially in the more formal parts, like terminology and formulas, that need to be improved. Those improvement points are listed in my detailed comments below.
Detailed comments:
D1.
A main weakness is the lack of comparison to other works in the area, which blurs the originality of this paper. For example, the following statement
"There have been multiple surveys and experimental studies conducted on methods designed for entity alignment across knowledge graphs [36, 49, 50]."
is left without any further discussion on what makes this paper different to [36, 49, 50], or which new conclusions are drawn from this work that are not existing in those papers. As I am one of the co-authors of [36], I see some similar conclusions between the two papers. Take for example the following quote from [36]:
"To compare the effectiveness and efficiency of the methods in realistic settings, we have extended the testbed of datasets with pairs of KGs usually considered in related empirical studies. Specifically, OpenEA, EAE and Leone et al. (2022), Zhang et al. (2022), Chaurasiya et al. (2022) employ only datasets with a low number of entities featuring descriptions and literal values. In our testbed, we have included five additional datasets whose unique characteristics allow us to draw new insights regarding the evaluated EA methods, which were not previously reported in Sun et al. (2020), Zhang et al. (2020), Zhao et al. (2022), Leone et al. (2022), Zhang et al. (2022), Chaurasiya et al. (2022). More precisely, supervised methods like RDGCN exploiting KG relations, are outperformed by unsupervised (AttrE) and semi-supervised (KDCoE) methods that exploit the similarity of literals in datasets of decreasing density, but with rich factual information (i.e., attributes)."
Isn't that close to what the authors try to pass as a main message of this work? I am not suggesting in any way that those paper are the same. However, I believe that such commonalities, as well as differences with works like [36, 49, 50] are very important and should be highlighted.
The following research question in [36]:
"Q4. Characteristics of datasets. To which characteristics of the datasets (e.g., sparsity, number of entity pairs in seed alignment, heterogeneity in terms of literals, predicate names and entity names) are supervised, semi-supervised and unsupervised methods sensitive?"
is also very close to the main research question of this paper.
The following statement
"The study also explores entity alignment on large-scale knowledge graphs like Wikidata [51] and Freebase [52], providing insights and addressing limitations in existing benchmark datasets",
when comparing this work to Zhang et al.[35], is also not mentioning what are those limitations that are addressed in [35] and whether those are also addressed in this paper. For example, in Zhang et al. [35], the authors address the bijection (aka one-to-one) assumption made by most EA methods. Isn't that similar to the notion of heterogeneity addressed in this paper?
Another example of unclear distinction between this and other works is the statement just before the list of contributions: "In a different way to the cited studies, in this paper, we focus on analyzing the performance of embedding-based EA methods having different representation learning principles on real-world datasets together with presenting an in-depth analysis of the dataset features."
What does "In a different way" mean? What makes this paper different, explicitly? Please elaborate.
Yet another example, which is not only unclear, but I would even dare to say, not true: "the singularity of our work with respect to related studies is that it goes beyond benchmark evaluations to address the practical challenges of applying EA models to heterogeneous datasets."
Finally, the notion of semantic similarity in Section 4.2.5 is also something that has been already suggested in the literature. E.g., see "Descr_Sim" and other related features in Table 4 of [36].
In addition to those points, there also exist some very relevant works that are not cited in this paper, but they should have been. To name a few:
- Fanourakis et al., "Structural Bias in Knowledge Graphs for the Entity Alignment Task", ESWC 2023: makes the following statement "As highlighted by previous empirical studies [14,55], and also confirmed experimentally in the current study, real-life KGs are characterized by power-law distributions with respect to the number of connected components and node degrees. Our KG sampling aims to assess to what extent EA methods leave the long tail of entities of KGs under-represented in the correct matches (true positives)." and also introduces a sampling mechanism for controlling the level of (structural) heterogeneity in the produced KGs.
- Leone et al., "A critical re-evaluation of neural methods for entity alignment", PVLDB 15(8), 2022:
criticizes the use of Hits@k in EA, instead of precision and recall (see my detailed point D2).
- Mao et al., "Relational reflection entity alignment", CIKM 2020 (see my detailed point D3).
D2. For a paper that adopts a critical look (and is rightfully doing so) against the existing evaluation mechanism of EA works, I was rather disappointed to see that the authors did not make a statement about the evaluation measures used in a typical EA survey. I.e., the authors adopt Hits@k, without mentioning the weaknesses and the assumptions of this measure. They even use Hits@k in contrast to Recall in one experiment (Table 3), which is something that is met sometimes in the literature, but it deserves a more detailed description. Those measures are not the same, but they can be used, sometimes, to compare methods, in the absence of other, better ways to compare such methods. E.g., when the evaluated systems are not open source, or the available resources are not enough to run new experiments.
This criticism against Hits@k is also provided in Leone et al., PVLDB 2022 (see D1).
D3. The authors have picked very good representative methods for EA, some of which are even published in 2023 (e.g., i-Align). However, some of the best-performing EA models are not evaluated, without proper justification. I am referring to RREA (Mao et al., "Relational reflection entity alignment", CIKM 2020) and COTSAE (Yang et al., "COTSAE: CO-Training of Structure and Attribute Embeddings for Entity Alignment", AAAI 2020), cited as [77] in the references of this paper but never mentioned in the text, both of which belong to the "co-training" group, as defined by the authors, if I am not mistaken. If the authors want to use one representative method from each group, then BERT-INT is, indeed, the method that I would also choose for the "co-training" group, as it is more widely used.
D4. In the selected datasets, it is good that the authors use two real-world datasets (DOREMUS and AGROLD), but it's not clear why they do not include the mono-lingual datasets from OpenEA, which are more heterogeneous than the multi-lingual ones.
D5. Looking at the distributions in Figure 1, I made two observations:
a) very high percentage of nodes (>60%) in SPIMBENCH and AgroLD have degree 1.
b) In all datasets, the distribution of KG1 is very similar to the distribution of KG2.
Those observations, especially the latter, somehow contradict the purpose of this paper, which was to study realistic data, especially in terms of heterogeneity. I am not questioning that those datasets are real; I am questioning whether those datasets are representative of common distributions in real datasets. Please elaborate on those observations. It may even be interesting to report a measure on the connectivity of those datasets. See also D1.
D6. As a follow-up to D1: It's not clear what are the key take-away messages from this paper that have not already been provided by other works. This should also be reflected in the Conclusions section.
Minor points (roughly in order of appearance):
- The first paragraph in the intro is too long. Perhaps split in two paragraph (e.g., the second one starting with "As data originates [...]".
- "EA is a process within the realm of entity alignment": This makes no sense; especially if you consider that EA stands for entity alignment.
- "Entity alignment involves establishing correspondences or alignments between entities in different knowledge graphs that share similar or identical meanings.": This is redundant and can be omitted.
- "LogMAP [53], receiving a Ten-Year Award for its significant impact from the International Semantic Web Conference a decade ago": What you mean by "a decade ago" can easily be misunderstood. Please rephrase.
- "DLinker [...] is developed in our team and thus facilitates further experiments": This is an honest statement and understandable. It would be worth mentioning, though, that LogMap is open-source and quite easy to adapt, too, in order to facilitate further experiments.
- "(supervised, semi-supervised, or unsupervised)[34, 36]": add blank space before "[34, 36]".
- "a relation predicate is embedded as a translation vector from a head entity to a tail.": Head and tail (entities) have not been introduced
- "Finally, there are methods that some papers mention as others": I suggest to put "others" in quotes.
- "KG’s co-training models instead": "KG's" is not used correctly here.
- "These models do not need to embed entire KGs which this feature makes": Delete "this feature"
- "MultiKE augments what the number of": Delete "what"
- Description of BERT-INT: I think that you only describe the name and the neighbor views. I missed the description of the attribute view.
- "We first show the degree distribution": The in-degree, the out-degree, both?
- "visualize it for degrees of up to 10": Why 10? I think the median values should also be reported. See also my detailed point D5.
- Table 2 "(all numbers indicate percentages)": Not true (e.g., KG Sizes)
- What does "Normalized" mean exactly in Equation 2? I suggest to replace/explain it with a formula, and also describe "We then normalize the amounts of relative semantic distances and take an average over them" more explicitly, accordingly.
- In Figure 2, there are some red dots in the bottom-right corner, without any links, as well as some red point(s) in the middle-left, surrounded by light-blue points. Could you explain what those points are?
- "Therefore, using these models could not be efficient in producing the alignments in real-world heterogeneous datasets.": I find this to be a strong statement that should probably be toned-down a little bit.
- References: Please prefer the peer-reviewed versions of papers, over their arXiv versions (e.g., for [36], but this also applies to many other papers in the references).
Assessment of the data file provided by the authors under “Long-term stable URL for resources”.
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data:
The github repo does contain a general README file, as well as internal README files within each model. This is helpful, but those README files are rather poor in content. That, in addition to the lack of comments in the code, decrease the readability and usability of this repo.
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why:
Yes
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability:
Yes (GitHub)
(4) whether the provided data artifacts are complete:
Yes
|