Review Comment:
This article proposes a multimodal message passing network to learn embeddings from knowledge graphs, a method that considers their multimodal node features instead of only relying on relational information. The five types of modalities considered are numbers, text, dates, images, and geometries that are represented in a single space together with relational information. By considering all these different types of information in embedding training, this approach seeks to increase information types available for individual entities. It is tested on a synthetically generated dataset as well as five available multimodal datasets on node classification and link prediction.
Overall evaluation:
This paper presents an interesting study on multimodal knowledge graph embeddings, considering several datasets and two tasks. Both, originality of idea and quality of writing are definitely high, even though the structure could be improved, particularly in terms of presentation of tables and results. While it is highly appreciated that the signficance of results is supported by proper statistical significance tests, the results show such radical differences regarding the impact of modalities on the analyzed tasks (node classification and link prediction) that detailed analyses of all factors would be required. As it is now, little discussion or detailed analysis on this point is offered. Instead, general assumptions with little or now evidence are offered. These difference could be due to a large number of different factors including the chosen architecture, encoding strategy, optimization procedure building on a potentially biased synthetic dataset, etc. Some experiments or at least detailed manual analysis of results would be needed to allow any conclusions about different types of modalities and ensure that the differences are not caused by other factors. Comparisons to other, baseline models generally also help with this process.
Baseline:
In order to truly extol the virtues of this idea, it would be necessary to prove that multimodal information indeed provide an improvement in performance over purely considering structural information. Nevertheless, the authors argue against this idea of a baseline but still claim a "better performance", which in fact cannot be supported by the results. In other words, I am not convinced by the argumentation that the model should explicitly not be compared against SOTA or other models. In my view, a better performance can only be achieved against some reasonable baseline that should ideally correspond to state-of-the-art approaches. Such a baseline also has the effect that differences across datasets can be clearly attributed to datasets if observed across models.
Archictural choices:
- The statement that more than two R-GCN layers do not improve performance could benefit from references where this has been shown.
- I am not entirely sure how to read the embedding matrix in Figure 2. In the text it is stated that the rows of the embedding matrix represent the multimodal embeddings, however, in the figure this seems not to be the case. Should it be read as transposed?
- Do I understand this correctly that numerical information by definition would always result in embedding dimensionality 1 since the normalized values are treated as embeddings? How does this compare to other approaches, e.g. [11], where a vector of the same dimensionality as the entity embedding is fused with the latter?
- In fact, I am wondering how this difference in dimensionalities might impact the performance of individual modalities. Have any experiments in this direction been performed?
- If I understood this approach correctly, the model is trained on an input matrix where the input vectors of all modalities are concatenated for each entity. Why was this decision taken over previously individually training embeddings and then fusing them? Also how does this approach allow for a fair comparison given that pretrained embeddings for visual information but not any other type of modalities was utilized?
- In reference to textual information, I am wondering why the standard BPE encoding method was not an considered? It has been successfully applied in many multilingual large language models, but instead a CNN and character-level encoding is utilized.
- Why does it make sense for this approach to train from scratch rather than re-use amply available large pretrained networks for textual information? For instance, for textual information XLM-R has shown remarkable performance across languages, domains and tasks. It seems that this route has been taken for visual information, but not for text. The use of language models is briefly addressed in the conclusion but merely as future work.
- Is the connection between a literal and an entity in the graph considered a relation with its individual relation embedding or does the relation embedding training consider entity links?
Experiments:
- How was the configuration on the training epochs finalized? Was it a random number chosen or is there any specific rationale behind chosing 400 respectively 1000 epochs? How does this setting ensure optimized training? Since this is even brought up as a factor that might influence results in the discussion, it is beneficial to include a rationale for this decision or improve on finding a well-defined number by standard methods.
- It would be interesting to also report on the validation performance, as this becomes increasingly common and generally offers interesting insights into the training procedure and model performance, as well as potentially further information on the differences between datasets.
Results:
- The way that the results are presented right now requires a considerable effort in terms of scrolling back and forth and finding the right table. This is partially due to the fact that the tables are not positioned where they are referenced in the text and that they are mostly spread across a whole page, where the direction also changes in the Appendix from one page to the next (and the appendix is merged with the bibliography). I propose merging the split/merged results into a single table to reduce the number of tables and including all tables in the text similar to Table 10 and 11. It would also be beneficial to highlight the best results in each column or for each modality.
- The results suggest that the impact of specific modalities varies greatly across datasets and tasks. It almost seems as if this choice of (not) considering them should be part of the training and optimization procedure, leaving this choice to the network. However, the way the information is encoded currently with concatenated embeddings of strongly varying dimensionality - some pretrained some not, it is doubtful that this option is given to the optimization procedure. In the discussion, the difference between synthetic and real-world datasets in terms of negative impact of individual modalities is clearly addressed and attributed to the type of network chosen. However, no evidence for this assumption is presented and it is questionable whether this difference might not stem from a substantial bias in the synthetic dataset or the encoding strategy. Further investigations on this point would be extremely interesting and might have been supported by including SOTA baselines.
- The differences in modality can also be observed even if the model has been finetuned in terms of hyperparameter settings for a specific task and dataset. As it stands now, the paper presents more information on the differences between datasets/other factors than modalities, since the hyperparameter and training parameter (esp. epochs) are relatively unclear and do not warrant for the conclusions presented.
Minor comments in order of appearance:
real-worlds datasets => real-world datasets
real-worlds knowledge graphs => real-world knowledge graphs
but rather than have completely => having
example of knowledge graph => a knowledge graph
a non-linearity activation function => non-linear
dataset—AIFB+—contained => dataset AIFB+ contained
Formatting:
- Tables and information in the appendix and/or that did not fit on the page should not be placed in the middle of the bibliography.
- Figure 2 is not anchored in the text, i.e., never referenced.
- Also Table 3 is not mentioned in the text.
|