TY - GEN
T1 - Multi-label triplet embeddings for image annotation from user-generated tags
AU - Seymour, Zachary
AU - Zhang, Zhongfei Mark
N1 - Publisher Copyright: © 2018 ACM.
PY - 2018/6/5
Y1 - 2018/6/5
N2 - This work studies the representational embedding of images and their corresponding annotations - in the form of tag metadata - such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.
AB - This work studies the representational embedding of images and their corresponding annotations - in the form of tag metadata - such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.
KW - Convolutional neural networks
KW - Image annotation
KW - Triplet embeddings
UR - https://www.scopus.com/pages/publications/85053910751
U2 - 10.1145/3206025.3206061
DO - 10.1145/3206025.3206061
M3 - Conference contribution
SN - 9781450350464
T3 - ICMR 2018 - Proceedings of the 2018 ACM International Conference on Multimedia Retrieval
SP - 249
EP - 256
BT - ICMR 2018 - Proceedings of the 2018 ACM International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 8th ACM International Conference on Multimedia Retrieval, ICMR 2018
Y2 - 11 June 2018 through 14 June 2018
ER -