Who are you referring to? Coreference resolution in image narrations
Arushi Goel, Basura Fernando, Frank Keller, Hakan Bilen

TL;DR
This paper introduces a new dataset and a weakly supervised model for coreference resolution in long image narrations, demonstrating improved performance and enhanced scene grounding capabilities.
Contribution
It presents a novel dataset with annotated coreference chains in image narrations and a weak supervision method leveraging linguistic priors for coreference resolution.
Findings
Model outperforms strong baselines in coreference resolution
Coreference resolution improves image grounding accuracy
New dataset enables better training and evaluation
Abstract
Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Who Are You Referring To? Coreference Resolution In Image Narrations· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
