Learning Social Image Embedding with Deep Multimodal Attention Networks
Feiran Huang, Xiaoming Zhang, Zhoujun Li, Tao Mei, Yueying He,, Zhonghua Zhao

TL;DR
This paper introduces DMAN, a deep multimodal attention network that jointly embeds social images by capturing both multimodal content relations and social network links, improving classification and search tasks.
Contribution
The paper proposes a novel deep model combining multimodal attention and Siamese-Triplet networks for social image embedding, integrating content and link information.
Findings
DMAN outperforms state-of-the-art embeddings in classification.
DMAN significantly improves cross-modal search results.
The approach effectively captures fine-grained content relations.
Abstract
Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
