Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?
Amit Kumar Chaudhary, Alex J. Lucassen, Ioanna Tsani, Alberto Testoni

TL;DR
This paper evaluates various decoding strategies in visual dialogue systems, revealing their limitations in balancing lexical richness, task accuracy, and visual grounding, and offers insights for developing improved algorithms.
Contribution
It provides a comprehensive comparison of decoding strategies in visual dialogue, highlighting their strengths and weaknesses to guide future improvements.
Findings
None of the strategies balance all key aspects effectively
Decoding strategies vary significantly in handling visual grounding
Insights suggest directions for more effective decoding algorithms
Abstract
Decoding strategies play a crucial role in natural language generation systems. They are usually designed and evaluated in open-ended text-only tasks, and it is not clear how different strategies handle the numerous challenges that goal-oriented multimodal systems face (such as grounding and informativeness). To answer this question, we compare a wide variety of different decoding strategies and hyper-parameter configurations in a Visual Dialogue referential game. Although none of them successfully balance lexical richness, accuracy in the task, and visual grounding, our in-depth analysis allows us to highlight the strengths and weaknesses of each decoding strategy. We believe our findings and suggestions may serve as a starting point for designing more effective decoding algorithms that handle the challenges of Visual Dialogue tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
