Unified Multimodal Model with Unlikelihood Training for Visual Dialog

Zihao Wang; Junli Wang; and Changjun Jiang

arXiv:2211.13235·cs.CL·November 28, 2022

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

Zihao Wang, Junli Wang, and Changjun Jiang

PDF

1 Repo

TL;DR

This paper introduces UniMM-UL, a multimodal model for visual dialog that combines answer discrimination and generation, employing unlikelihood training on negative answers to improve response quality and reduce dull outputs.

Contribution

The paper proposes a unified model extending ViLBERT with novel attention masks for answer generation and discrimination, and incorporates unlikelihood training to suppress incorrect answers.

Findings

01

Achieves 69.23 NDCG score on VisDial for generative tasks.

02

Attains 75.92 and 76.17 NDCG scores for discriminative tasks in single-model and ensemble.

03

Outperforms prior models in generative visual dialog benchmarks.

Abstract

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zihaow123/unimm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsVision-and-Language BERT