TL;DR
This paper enhances multimodal dialogue systems by extending the HRED model to better incorporate visual and textual context, leading to improved response quality in fashion domain conversations.
Contribution
It introduces a multimodal extension to the HRED model and demonstrates its superiority over baselines in multimodal dialogue generation.
Findings
Improved text similarity metrics with the new model
Error analysis reveals current model limitations
Multimodal extension outperforms baselines
Abstract
In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics. We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
