Studying the Impact of the Full-Network Embedding on Multimodal Pipelines

Tracking #: 1962-3175

This paper is currently under review
Armand Vilalta
Dario Garcia-Gasulla
Ferran Parés
Eduard Ayguade
Jesus Labarta
E Ulises Moya-Sánchez
Ulises Cortés

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.
Full PDF Version: 
Under Review