Studying the Impact of the Full-Network Embedding on Multimodal Pipelines

Tracking #: 1859-3072

This paper is currently under review
Armand Vilalta
Dario Garcia-Gasulla
Ferran Parés
Eduard Ayguade
Jesus Labarta
Ulises Cortés

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural networks, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale representation of images, which results in richer characterizations. To measure the influence of the Full-Network embedding, we evaluate its performance on three different datasets, and compare the results with the original embedding scheme, and with the rest of the state-of-the-art. Results for image annotation and image retrieval tasks indicate that the Full-Network embedding is consistently superior to the one-layer embedding. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme, something feasible thanks to the flexibility of the approach.
Full PDF Version: 
Under Review