VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

Yunshui Li; Binyuan Hui; Zhaochao Yin; Wanwei He; Run Luo; Yuxing; Long; Min Yang; Fei Huang; Yongbin Li

arXiv:2309.07387·cs.CL·September 15, 2023

VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

Yunshui Li, Binyuan Hui, Zhaochao Yin, Wanwei He, Run Luo, Yuxing, Long, Min Yang, Fei Huang, Yongbin Li

PDF

Open Access

TL;DR

VDialogUE introduces a comprehensive benchmark and evaluation metric for visually-grounded dialogue systems, covering multiple tasks and datasets to standardize assessment and foster progress in the field.

Contribution

It proposes a unified evaluation benchmark with five core tasks, a novel VDscore metric, and a baseline model to advance multi-modal dialogue research.

Findings

01

VDialogUE covers six datasets and five tasks for evaluation.

02

VDscore provides a comprehensive performance assessment.

03

Baseline model VISIT demonstrates effective multi-modal dialogue capabilities.

Abstract

Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems