Multimodal Named Entity Recognition for Short Social Media Posts
Seungwhan Moon, Leonardo Neves, Vitor Carvalho

TL;DR
This paper introduces a new multimodal NER task for social media posts combining text and images, creating a dataset and a model that leverages visual context to improve entity recognition in noisy, short social media data.
Contribution
The paper presents the first dataset for multimodal NER on social media and a novel model with modality attention that effectively integrates visual context to enhance NER performance.
Findings
The multimodal model outperforms text-only NER models significantly.
Visual context improves entity recognition accuracy in noisy social media posts.
The modality-attention mechanism effectively filters relevant information from multiple modalities.
Abstract
We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stories with fully annotated named entities). We then build upon the state-of-the-art Bi-LSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modality-attention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
