TL;DR
This paper introduces UniMM-UL, a multimodal model for visual dialog that combines answer discrimination and generation, employing unlikelihood training on negative answers to improve response quality and reduce dull outputs.
Contribution
The paper proposes a unified model extending ViLBERT with novel attention masks for answer generation and discrimination, and incorporates unlikelihood training to suppress incorrect answers.
Findings
Achieves 69.23 NDCG score on VisDial for generative tasks.
Attains 75.92 and 76.17 NDCG scores for discriminative tasks in single-model and ensemble.
Outperforms prior models in generative visual dialog benchmarks.
Abstract
The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVision-and-Language BERT
