Review Comment:
The authors propose an enhancement of the Microsoft Academic Knowledge Graph by tackling the three tasks of author name disambiguation, field of study classification and tagging, and the creation of embeddings. The MAKG is a valuable and publicly available resource, and any extensions or improvements on it are welcome. All three tasks are reasonable, and a detailed overview of related literature is provided. In general, I see as the main contribution of the paper the application of methods for solving different, well-known problems on a large-scale knowledge graph. However, the problems are mainly reusing existing approaches, and therefore the contributions seem rather technical but do not provide many novel insights into the tackled research questions. Therefore, I see several issues with the originality, novelty and the application of results, as well as with further issues as discussed below.
- Evaluation: In the tasks of author name disambiguation and tagging, no comparisons to existing approaches are provided. This becomes particularly evident in the case of author name disambiguation, given that this section starts with a comprehensive overview of approaches from which only one is considered further. An important dimension is the scalability of the approaches, but such analyses are only provided in Section 3.5.3 (rather vaguely) and Table 21 (only briefly).
- Approach: The approaches to all three tasks are rather simple (which is motivated by efficiency reasons) and mainly use existing tools and models, so there is not much novelty (page 22: "custom implementations can also be developed, though such tasks are not suitable to our paper").
- Author name disambiguation: The author name disambiguation follows a very simple approach that relies on many hyperparameters set in an ad-hoc manner. As a reason, the authors argue for the efficiency of that approach and the lack of training data. Indeed, their potential training data only consists of 49 positive pairs. This leads to the following questions: Without blocking, are there still only 49 positive pairs? Given this low number of positive pairs, are the (important) analyses in Table 7 actually representative for the feature importance - obviously, some features do not contribute at all (e.g., score_coauthors) or even lead to wrong results (e.g., score_titles)? I would have expected an increased amount of positive examples (e.g., by a manual extension of the positive pairs?) and proper learning of the hyperparameters (i.e., feature weights). I have some more questions/comments on the proposed approach: (i) The role of the postprocessing in Figure 6 is not entirely clear - is it to merge blocks when they exceed the size of 500? How is that done? (ii) Does the final clustering algorithm (not for evaluation) consider the ORCID labels? (iii) Is the feature based on co-authors updated iteratively when disambiguation was done? (iv) I agree to the discussion in Section 3.6 that it would be interesting to have more sophisticated blocking techniques, which would increase originality of the approach.
- Tagging: At the end of Section 4.4.5, it says "statistics about the keywords are given in Sec. 6", although this is not the case. Also, there is no evaluation of the tagging, not even an anecdotal example of a single keyword.
- Embeddings: Usage scenarios of the generated embeddings are not provided (and I don't think that "our focus [is] on the MAKG and not on machine learning" (page 22) is a proper excuse when the goal is to make the MAKG accessible for machine learning). One example could be to use it as a feature for the author name disambiguation (obviously, on the original version of the MAKG). Also, the evaluation setting is unclear, e.g., in Table 22 (average MR for "Author"? Is it for the link prediction performance of all triplets where the head is an author?).
- Statistics: Section 6 provides several statistics from which not all are directly relevant to the three tasks discussed in the paper. Also, some parts of the statistics are rather lengthy (e.g., the discussion of Figure 9), while other parts are missing (tags/keywords).
Dataset and website:
- The website's homepage is outdated: the statistics are from 2018, and only embeddings of papers are mentioned.
- Paper abstracts seem to be missing in the newest zenodo version.
- Similar to the 2018 version, it would be great to also have sample files for the new version.
- As known ("We try to fix that in the near future"), the "knowledge graph exploration" has problems.
- The prefixes provided by the SPARQL endpoint are not synchronised with Figure 7 (/the schema on the website).
- I have shamelessly used the SPARQL endpoint to query for my own papers and found a) "duplicate" papers (arxiv and conference publication -> is there also a need for paper disambiguation?) and several paper titles such as "TPDL", ESWC" and "ESWC (Satellite Events)" (which seem to be added as a second title to papers).
- In the resource view ("https://makg.org/entity/..."), prefixes are simply "ns1", "ns2", ..., which is technically fine, but not ideal.
Code (https://github.com/lin-ao/enhancing_the_makg):
- There are many lines such as "file 06 is used to generate data for table 27" in the readme, which is not very helpful and seems to refer to tables in a master thesis ("Code for the Master Thesis") instead of the reviewed paper. On the positive side, there is a proper execution script for the entity resolution (on the other hand, this is missing for the classification).
Other and minor:
- Page 3, Line 3 left (P3L3L): "239 [...] publications"
- List. 1: Does this mean an affiliation that has published 99% in biology, many citations and one machine learning paper is returned here? Citations of the respective machine learning papers would be more relevant.
- Figures 3,4,5 plus their "observations" (P4L45R): The observations can be hardly (observation 1) or not at all (observation 3) observed from the figures
- Figure 5: Which plot belongs to which y-axis? Do I understand correctly that there were 11 million Jena queries in total, but there were days when no Jena queries were queried ("avg. # days with min. 1 request" < 1)?
- P5L30L: "For instance" gives the impression there are more examples. It would be interesting to see them.
- P6L5L: "e.g., architecture" is unclear at this point (it is explained later)
- P846L: Isn't it also necessary to have a mapping from each author in A to an author in A~?
- P9L7R: "d" is not defined (only "sim")
- Section 3.6: This should be better connected to the passage before.
- Table 11: What do the numbers mean?
- P17L20L: "we prepare [our] training set"
- P18L18R: That part is repeated.
- Table 19: Which numbers are bold?
- P2111R: "together;[ ]we"
- P2237R: "select[ed] number"
- Table 24: I have not done the maths, but does it make sense to have 3 authors per paper and 3 papers per author, but 11 co-authors per author?
- P2543L: "likely misleading" -> this could be checked manually for this specific example.
- P2547R: How do outliers affect the median?
- Figure 11: There can't be a negative number of authors; the y-axis needs to be cut.
- Dataset names: There are different graphs involved (MAG, MAKG 2018, MAKG 2020 before enhancement and after; evaluation datasets, e.g., in Section 4.4.1) - clear names/abbreviations could help the reader.
|