Review Comment:
This manuscript focuses on the parallel tasks of image annotation and retrieval, which are typically addressed using multimodal embeddings. The manuscript is an extension of previous work published at the SemDeep2 workshop at IWCS 2017 by Vilalta et al. (2017). In that published work, the authors replaced a typical image representation (the last layer of a CNN) with the full-network embeddings (FNE) suggested by Garcia-Gasulla et al. (2017), a representation which offers a richer visual embedding space by deriving features from multiple layers while also applying discretization. While the previous work is based on the approach of Kiros et al. (2014), this work uses the improved version of Vendrov et al. (2015), and shows consistent improved performance across the three datasets on which the methods are evaluated. In addition, more exhaustive experiments are conducted.
Strengths:
* The evaluation is extensive and fair: the model is compared with the state-of-the-art methods for each task, as well as on the original model by Vendrov et al. (2015) without the FNA component. The authors have controlled various factors such as hyper-parameter values, dataset splits and training time. The models are evaluated on three datasets.
* The implementation details are very informative and detailed.
Major Concerns:
* This work is marginally related to the topic of the special edition, while being related to deep learning, it is not related to semantic web. The introduction suggests that the image annotation and retrieval tasks may help semantic image indexing, but this is pretty much the only connection to semantic web.
* The contribution of this work is pretty small: similarly to the work of Vilalta et al. (2017), while the FNA-based approach is superior to the original CNN-based approach, the proposed methods are constantly outperformed by the state-of-the-art methods in both tasks and on all datasets. The conclusion section suggests that incorporating the FNA representation into these SOTA models may improve their performance as well, but this is left for future work. The additions of this manuscript to the work of Vilalta et al. (2017) are sufficient but not much more.
* The manuscript needs proof-reading. Specifically:
- The related work section reads like a long list of approach names which are not elaborated or explained. It was quite confusing and would have been better if you elaborated on the important approaches, and omitted or wrote a more general description of the other approaches.
- Section 3 needs better structure, e.g. start the section by detailing what each subsection will discuss.
- The FNA approach should be described as part of the related work section, as it is previous work and not part of the contributions of the current work. In addition, it is referenced several times before it is introduced in 3.1.
- Many grammatical errors and typos (see below).
Minor comments / needs clarification:
* The VGG and UVS models are mentioned several times but never described.
* It seems that the evaluation metrics only capture recall. Isn't there a standard evaluation metric for these tasks that captures precision?
* Section 3.1: what is a pre-trained CNN? What is the training objective and data?
* Section 3.5: define curriculum learning.
* Section 4.2: refer to the equations in the model descriptions.
* Section 4.3: which DL framework did you use? Will the code be made available?
* Table 1: which values were tested for each hyper-parameter?
* Section 5: "results are now very close to the ones obtained by other methods" - this is simply not true.
* When you say something is "significant", did you perform significance tests? If so, please elaborate, otherwise either perform them or change to "substantial" or another modifier.
* Section 5: dataset size correlates with performance: makes sense, but it's not completely comparable because these are different datasets. A better experiment would be to change the training set size while keeping the same test set.
* References:
- Change references from Ryan Kiros to Jamie Ryan Kiros.
- Change arXiv references of papers that were published in conferences or journals, e.g. [9] was published at ICLR 2016.
- Reference [39] seems unrelated and is never referenced.
Typos and grammar errors:
* Section 2:
- ...of the GRUs and the last fully-connected *layer* of the CNN
- examples focus only *on* the hardest of them
- A different group of methods is based *on* the Canonical...
- a neural architecture that project*s* image*s* and sentence*s*
- best results on *the* Flickr30K dataset
- DANs exploit two... (remove s)
* Section 3.1:
- "value of the each feature" - remove "the"
* Section 3.2:
- all the words in *the* train
- with the GRUs and the word embedding*s*
- the pipeline training procedure consist*s of* the optimization
- Equation 1: small i and c in the sum subscript
- dot product of the vectors as *a* similarity *measure*
* Section 4.1:
- is an extension of Flickr8K *which* includes it
* Section 4.2:
- We will experiment => We investigate
* Section 4.3.2:
- that higher *dimensionality helps obtaining*
* Section 5:
- The second part summarize*s*
- Each of these blocks contain*s*
- Tables 2-3: Flickr8*K* and Flickr30*K*
- "The modifications are We can see" - ungrammatical
- as it introduces => as it incorporates
- this is *e*specially significant
- experimenting on *the* MSCOCO dataset
- SOTA contributions => SOTA methods
- explicit => specify
- It*'*s important to keep in mind
- Weighting the results => Comparing the results
- we see that *their* relative performance
- In *the* experiments on MSCOCO
- add*s* very little improvement
* Section 6:
- exactly same model => exact same model
- these experiments show that the instability *of the* training
* Section 7:
- exactly same model => exact same model
- alleviate*s* this *problem*
|