Modeling Coreference Relations in Visual Dialog
Mingxiao Li, Marie-Francine Moens

TL;DR
This paper introduces a novel approach to improve coreference resolution in visual dialog by integrating linguistically inspired soft constraints into a deep transformer model, achieving state-of-the-art results without pretraining.
Contribution
It proposes two unsupervised soft constraints based on linguistic and discourse features to enhance coreference understanding in visual dialog models.
Findings
Achieved new state-of-the-art performance on VisDial v1.0 dataset.
Improved coreference resolution without pretraining on additional datasets.
Demonstrated effectiveness through qualitative analysis.
Abstract
Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. Most previous works have focused on learning better multi-modal representations or on exploring different ways of fusing visual and language features, while the coreferences in the dialog are mainly ignored. In this paper, based on linguistic knowledge and discourse features of human dialog we propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset shows that our model, which integrates two novel and linguistically inspired soft constraints in a deep transformer neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
