Ontology Alignment Revisited: A Bibliometric Narrative

Tracking #: 2394-3608

Authors: 
Majid Mohammadi
Amir Ebrahimi Fard

Responsible editor: 
Jérôme Euzenat

Submission type: 
Survey Article
Abstract: 
Ontology alignment is an important problem in the Semantic Web with diverse applications in various disciplines. This paper delineates this vital field of study by analyzing a core set of research outputs from the domain. In this regard, the related publication records are extracted for the period of 2001 to 2018 by using a proper inquiry on the well-known database Scopus. The article details the evolution and progress of ontology alignment since its genesis by conducting two classes of analyses, namely, semantic and structural, on the retrieved publication records from Scopus. Semantic analysis entails the overall discovery of concepts, notions, and research lines flowing underneath ontology alignment, while the structural analysis provides a meta-level overview of the field by probing into the collaboration network and citation analysis in author and country levels. In addition to these analyses, the paper discusses the limitations and puts forward lines for the further progress of ontology alignment.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/Feb/2020
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

I will not call this paper a survey paper as it is a traditional bibliometric analysis of a small domain - ontology alignment. It does not provides the details on methods, applications, and development of the domain, rather than some facts about topics, authors, countries and journals. It is hard to see the novelty of this paper. From the perspective of bibliometric analysis, it is a traditional one and on a rather small dataset, the analytical angles are not new, just conventional bibliometric outputs, nothing novel and exciting; if from the perspective of the survey of the ontology alignment field which I think both authors might be the domain experts, there are no in-depth insights and detailed analysis of the methods and applications of this field, what are the pros and cons, what is the future of the research direction for ontology alignment. The only thing might be interesting is to apply conventional bibiometircs to look at the publications of ontology alignment. I am not sure whether such can be counted as novelty.

Some concrete comments:
- can you please explain what are the four keywords you are using to retrieve ontology alignment papers from scopus and why you are selecting these four words, any taxonomy or co-word analysis has been conducted to make such decision. If it is based on expert opinions, what are these experts and how their opinions are considered.
- author name disambiguation: it is not clear whether you did the author name disambiguation which is critical for the outputs
- if your dataset is very small, only several thousands of papers, can we do some in-depth content analysis, extracting some knowledge entities using Bert, and find relationships or concept evolution. I mean that besides showing who has published more papers or who are the top-cited authors, you can tell a better and interesting story by showing the evolving of the field using in-depth concept analysis (such as entitymetrics).

Review #2
Anonymous submitted on 03/Mar/2020
Suggestion:
Major Revision
Review Comment:

This paper presents a bibliometric study of the ontology matching field. Two kinds of analysis were carried out: a "semantic" analysis concerning topic modeling; and a "structural" analysis concerning the network of research collaborations between teams and countries. "Semantic" analysis applies LDA topic analysis on title, abstract and keywords of research items (journal, conference, workshop papers). This analysis is based on words and their frequency. Complementary, the "structural analysis" is carried out on the top-cited articles in top-ranked journals. It first analyses the collaborations between different authors and countries and then the disciplines associated to the topics of the analyzed data. The data were collected from Scopus and concerns publications between 2001 and 2018.

To the best of my knowledge, this is a first bibliometric study in the ontology matching field. Although the paper describes an interesting piece of work, it can be improved in several directions, as below.

** Scope of the study **

* the study addresses the task of ontology matching. Ontology matching and instance matching are distinct but closely related tasks aiming at facilitating the interoperability between different knowledge bases at their terminological and assertional levels. The choice for focusing on the first task could clarified in the introduction as this choice strongly impacts the study here.

** Aim of the study **

* the authors state that "Although these materials [ontology matching surveys] are essential and help researchers get familiar with the notions of ontology alignment, they do not provide an overview of the field." I do not agree on that statement. The surveys on the field provide an overview on the background and approaches and strategies in ontology matching. The study here is complementary to what has been so far presented in other survey papers, by analyzing the field under another perspective. The authors should be attentive to such statements. The whole paragraph should be rephrased in that sense.

* another point is that, contrary to what is stated in the introduction ("the evolution and progress of ontology alignment since its genesis", the evolution of the field is mostly described in quantitative terms (quantitative analysis of publications). We can see the "topics" that have been so far subject in the papers, however the temporal dimension could be exploited. It could be interesting to have such a timeline with the "topics" and how they appear in this timeline (e.g., adding the temporal dimension in Fig. 2).

* still with respect to the "topics", it is interesting to see that the term "Process Model Matching" appears. In terms of OAEI participation, the track addressing this task has received a (very) limited number of participants and has been unfortunately discontinued. It could be interesting to discussion the relation between the topics and evaluation tasks in OAEI. A different point is that the topics "machine learning" and "biomedical ontologies" are quite different (but of course related topics) and they could be in two different clusters.

** Kinds of analysis **

* with respect to the "semantic analysis", it could be rather called "topic modeling" once it is "reduced" to topic "discovery". In that sense, the authors should rephrase "Discovery of concepts, notions and research lines". Are the "notions" addressed here? It should also be clear since the beginning of the paper that the "thematic" analysis is based on journal papers only on the recent six years.

** Data sources and methodology **

* the authors use the terms "ontology matching", "ontology alignment" and "ontology mapping" as interchangeable terms. In fact, in the context of this study (using those as keywords for retrieving research items) it is reasonable. However, as in Shvaiko and Euzenat books, these terms have different definitions. This could be rather clarified in the paper.

* specific parts of the methodology could be described with more details: from the 3,289 retrieved documents, 2,975 were labeled as relevant. This was a manual annotation? How many annotators? Which are the annotation criteria? More importantly, the steps listed in Table 1 are not clear: there is no intersection in the sets returned in each step? (examining the title, examining the abstract, etc.). Only 61 papers seems to result of the annotation when "inspecting the whole paper"? Additional (and important) is also missing: venue of publication of the papers per type.

* the choice of Scopus (instead of Web of Science) is briefly introduced. However, other sources of data such as Google Scholar, could be used. Furthermore, Scopus should be described in terms of volume of data, etc. Statistics on the number of WoS documents not indexed by Scopus?

** General Data Protection Regulation **

* one very important aspect in this paper is about the publication of statistics on researchers and their production. It provides an analysis of the "impact of authors" and "their influence on the ontology alignment". Publishing the statistics on that (public) data involves following the General Data Protection Regulation rules.

** Reproducability **

* it could be interesting to have some lines about the reproducability of the study.

** Conclusions **

* the "discussions" and "conclusions" parts are a little repetitive and some passages could be reduced in order to leave room for more "specific" conclusions. Furthermore, some passages of the conclusions should be revised and rephrased. Splitting the ontology matching community in "OAEI organisers" and "China researches" should be avoided. "OAEI organizers and participants have higher average citations than other researchers." This "phenomenon" could be explained by the fact that almost all papers on ontology matching perform an evaluation, which is many cases based on OAEI datasets (hence citation to OAEI papers). Other insights on that? The authors also state that "there are several topics that are totally neglected by researchers and the OAEI organizers in particular... modeling knowledge graphs". The authors neglected the existence of many OAEI tracks, including "knowledge graphs".

* while the conclusions provide quite general statements, there is a progress in the field since 2018, in particular with the new OAEI tracks covering complex alignments and other domains than biomedical ontologies. This should be mentioned instead.

Minor comments, to cite a few:

- "This paper delineates this vital field" => "This paper delineates this field"
- "It soon found"
- "many research studies have been dedicated to resolving the heterogeneity among information systems"
- "improve the field" ?
- "there is a book" => there are two editions of a book
- "some useful reviews and surveys" => there are a number of reviews and surveys
- "to benefit the tools"
- "articles [59]d."
- Figure 1 mixes the pipeline and paper organization (Section 2, ...)
- "This strategy is called ontology alignment (also called ontology mapping and ontology matching)" => use the "task" instead of "strategy".
- 4.2 Outputs in Top Percentiles WordWide => more intuitive title ?
- Fig. 9 does not bring much information
- "their influence on the ontology alignment" => "their influence on the ontology matching field"
- Jiminez => Ernesto Jimenez-Ruiz
- [92?]
- Legend of Figure 19 is incorrect
- Conclusions and Discussion => Discussion and Conclusions

Review #3
By Jérôme Euzenat submitted on 11/Mar/2020
Suggestion:
Reject
Review Comment:

The paper presents a bibliometric analysis of the field of Ontology matching. It applies a 'semantic analysis', trying to extract topics from papers, and a 'structural' analysis studying only the bibliographic characteristics of the literature (authorship, citation, etc.).

Since, the Semantic web journal is not a journal about bibliometrics, this paper is rather particular for the journal. To be clear, it only uses classical techniques and does not apply semantic web technologies to bibliometrics. However, a part of the semantic web, Ontology matching, is the object of this study. That could be of interest to the journal readership, especially if remarkable features of the field were unveiled.

The work seems to have been seriously done, as far as I can judge and most of what is expressed is clear.

Unfortunately, after reading it, it does not seem worth publishing.

The main problem is the lack of objective: what is this work trying to assess? For most of the presented study, there is no hypothesis tested and no interesting finding reported (except at one point, but without seriously seeking to explain it, see below). It is just like if the reported figures were totally indifferent and that the paper would have been the same with different figures.

Another issue is the lack of baseline for the presented data. Indeed, it is impossible to know if the observed properties are specific to the Ontology matching field, or if they apply equally in other fields. At least, it would have been good to have a comparison with the broader context, i.e. comparing with Semantic web and Computer science. It is possible to observe features of the Ontology matching field, but no way for the reader to understand if these are remarkable or not.

Finally, in these times of Open science, it is regrettable that no mention is made of the availability of the data.

These points are the major issues. I discuss below various problems, some of them related to the issues above, some of them discussing particular points. They may help the authors to improve their paper.

* Organisation:

- The introduction does not state any goal for the paper, neither claim any findings. It rather describes the applied treatments.

- Section 2 provides a methodology. However, in absence of statement about the goal of the analysis, it is not possible to judge the relevance of the methodology.

- Section 2.1 details lenghtly the preprocessing of WoS, before turning more succintly to Scopus which was actually used. This seems a strange way to present things.

* Data interpretation:

- The topic analysis is not particularly insightful. In particular, it does not provide much information on ontology matching but on its use. It seems to gather the terms in an unprincipled way (heterogeneous and automatic appear in most of them, biomedical is in the learn cloud while anatomy is in the query one, etc.). It may have been interesting to see all the generated 'topics' that the authors did not retain. In the end it is unclear, which conclusions may be drawn from the topic analysis.

- p10: 'Ontology alignment outputs form 6.1% of the top 10% most cited article worldwide in year 2013' (I simplified). How can this be? From Fig. 3, there seems to be no more than 300 papers in scopus on 'ontology alignment' for 2013. If they are all in the 10% most cited papers and these are 6.1% of them, this means that there was only 10*300/6.1%~50000 papers indexed by Scopus (if not all 300 are in the most cited, then this is even less). This does not seem to be right: I counted 2.8Mdocuments in Scopus for 2013.
It seems to me that what was meant is that 6.1% of the 'ontology alignment' papers are in the 10% most cited papers. Again with no comparison to the same figure for Semantic web or Computer science, it is difficult to tell that this figure is specific to Ontology alignment (there are fields with more citations and fields with less citations, e.g. Mathematics, and putting them all together means that some are above average and some other are below).

- Section 4.3 is about disciplines relevant to Ontology matching. Given the broad categories used here (the level 2 categories of Fig 7), is it unclear that this characterisation is useful for something.

- Section 5.1-5.2 about collaboration are those that could be thought of as providing some findings. Figure 8 is stunning at providing two identified clusters. The authors do not provide much explanation about this phenomenon, they suggest that may be the researchers from one cluster are not curious about the others. However, these graphs being computed on collaboration, a symmetric measure, it seems that this explanation should, at the very least, be applied symmetrically. It is difficult from this data alone to provide an explanation, but many could be put forth. In particular, the fact that one of this cluster is mononational and the other international suggest that the explanation comes from some national elements (but see discussion below). These may be linguistic factors, the collaboration approach, work approach (many coauthors, many authors of only one paper, e.g. undergraduate students: this can be studied bibliometrically), publication policies (strong incentive to publish many papers and in scopus indexed journal, hence less in the Ontology matching workshop). It is possible that many of these factors play some role together... Finally, again in the absence of comparison with other fields, it is difficult to assess if this is due to the Ontology matching field.

- This judgement made on collaborations is also made on citations (though to a far lesser extent). That could have helped sheding light on this matter because citation is not symmetric. Unfortunately, in 6.2, citations are only reported as numbers assigned to papers and country so, they are not helpful. This is too bad because if a community has less citation per paper than another, it is difficult to explain it by discrimination if both communities have the same citation pattern (they both cite less the same community). At least, it would have been worth to rule out this possibility.

- As I understand from the text, six communities were extracted and only two are shown in Figure 9. If the number 6 was not given to the algorithm and is significant, then the six should be shown.

- 5.1 Author collaboration: the conclusion drawn on page 14 are very general and not specific to Ontology matching.

- p16 "the research outputs with at least one Chinese author have not gained enough attention": it is unclear on what ground this statement is based. Same thing for "they do not get enough attention, possibly the attention they deserve".

- Again, 5.3-5.4 would deserve to be compared with the broader Semantic web/Computer science fields.

- The authors "encourage the organisers of OAEI" to have benchmarks on the identified topics. Unfortunately, these topics are not application domains, like biomedicine, but application techniques, like "Semantic Web Services, agent-based modelling, knowledge-graphs, and business processes (cited directly from the paper)". This means that there are not many ontologies to match there... and some of them have been considered, e.g. Process matching.

* Data presentation:

- Figure 2 displays data as tag clouds. The precise interpretation of tag clouds is quite unclear to me, so if there is one, it should be provided. In general, it does not seems like tag clouds are a proper scientific visualisation instrument (no unit, no scale, esthetic arrangement).

- Figure 8 is interesting, but it would also be interesting to understand the space, i.e. what are the principles of entity placement. The same applies to Figure 17.

- The assignment of authors to countries is not specified. One of the most collaborative "Chinese scientist" is "S. Wang". I assume that this is Shenghui Wang. Shenghui published her work while at VU Amsterdam. It is unclear that she should count as Chinese (in such a case, Pavel Shvaiko is from Belarus, Ernesto Jimenez-Ruiz from Spain, Cassia Trojahn from Brazil, etc.). In this sense, she is atypical (less and less atypical as time passes), and seems to indicate that the two clusters are rather based on the involvement in an international collaboration network or not, rather than nationality. This, in turn, may have other causes (see above).

- Figure 9 is unreadable in black and white.

* Form:

- The title of the paper is quite strange: ontology alignment is not really revisited and there is not real "narrative" provided here. Moreover, this is not really the purpose of scientific journals to publish "narratives", but findings.

- The introduction uses a flourished language that is also a bit remote from fact. For instance: "the heterogenity problem was quite epidemic" is not particularly clear.

* Details:

- p4: '"ontology alignment", which is interchangeably referred to as "ontology matching" or "ontology mapping"': it is not clear by whom.

- It may have been interesting to look for outliers in this data set. In particular, books and review papers traditionally get a lot of citation: do these figures look the same if they are retracted from the corpus? I do not know if it is accepted practice in bibliometrics and this is less important than comparing with external fields.

- p22 there seem to be a missing reference.

- In some instances, such as reference 2 or Table 2, problems with characters.