Review Comment:
This manuscript focuses on multimodal image and sentence embeddings. It is an extension of previous work by Vilalta et al. (2017), in which the authors replaced a typical image representation (the last layer of a CNN) with the full-network embeddings (FNE) suggested by Garcia-Gasulla et al. (2017), a representation which offers a richer visual embedding space by deriving features from multiple layers while also applying discretization. These embeddings are evaluated on the parallel tasks of image annotation and retrieval.
The contribution of the current work is in (1) integrating FNE to the original pipeline of Kiros et al. (2014), as well as to two versions of order embeddings, showing consistently improved performance upon using a standard image representation across the three datasets on which the methods are evaluated; (2) performing extensive experiments to study the causes of performance gains, when the methods are trained with the best hyper-parameters; and (3) using curriculum learning to increase the stability and performance of these methods.
The evaluation is extensive and fair: the model is compared with the state-of-the-art methods for each task, as well as on the original model by Vendrov et al. (2015) without the FNA component. The proposed extension on its own performs worse than the state-of-the-art, but it suppresses the performance of the same method with a typical image representation. Moreover, the authors study the effect of performance gains in the different methods and control for various factors such as hyper-parameter values, dataset splits and training time.
Detailed Comments:
* Connection to semantic web: Thanks for addressing my concern about the connection to semantic web. The introduction reads a bit vague, and I think it can benefit from a motivating example - e.g. a concept which is vague (unclear how to represent in “semantic web”), how word embeddings make it representable, and how image embeddings improve its representation further.
* Performance: There is absolutely no problem with publishing a paper about a method that achieves less than state-of-the-art performance. I believe the comment (from all of the reviewers) was because the previous version didn’t present the goal of the paper clearly: "testing of the proposed methods, showing clearly the real impact of its main contributions" - the current introduction is clearer about this goal. [no changes required]
* Introduction: the first 3 contributions could be written in one (“Integrating the FNE into x, y, anc z”) - it is unnecessarily repetitive.
* Table 1: which values were tested for each hyper-parameter? You need not mention all the results with all the different hyper-parameters, but if one of the main goals of the paper is to study the sources of empirical gains, it is imperative to list the values tested (can be placed in an appendix).
* When referring to experiment results I’m always careful not to use “significant” when I didn’t run significance tests. I choose “substantial” instead. Although they are synonymous, the word “significant” often implies achieving a certain p value in a significance test, so it is a bit misleading.
* References: Regarding Jamie Ryan Kiros, this name is written in her PhD thesis: https://tspace.library.utoronto.ca/handle/1807/89798, and on her Google Scholar profile: https://scholar.google.ca/citations?hl=en&user=b_MXwoAAAAAJ. I believe she changed her name from Ryan to Jamie but keeps both names in citations of papers published by the name Ryan. It’s good that you emailed her and if she answers, do as she tells you, of course.
|