Visual Dialogue without Vision or Dialogue
Daniela Massiceti, Puneet K. Dokania, N. Siddharth, Philip H.S. Torr

TL;DR
This paper introduces a simple CCA-based method for visual dialogue that ignores visual input and dialogue sequence, revealing issues in current complex models and achieving competitive results with minimal computation.
Contribution
The paper presents a minimalistic, gradient-free CCA approach that challenges existing complex models in visual dialogue by highlighting dataset biases and evaluation limitations.
Findings
CCA method achieves near state-of-the-art mean rank performance
Current models may rely on dataset biases rather than true understanding
Simpler models can perform competitively, questioning the necessity of complex architectures
Abstract
We characterise some of the quirks and shortcomings in the exploration of Visual Dialogue - a sequential question-answering task where the questions and corresponding answers are related through given visual stimuli. To do so, we develop an embarrassingly simple method based on Canonical Correlation Analysis (CCA) that, on the standard dataset, achieves near state-of-the-art performance on mean rank (MR). In direct contrast to current complex and over-parametrised architectures that are both compute and time intensive, our method ignores the visual stimuli, ignores the sequencing of dialogue, does not need gradients, uses off-the-shelf feature extractors, has at least an order of magnitude fewer parameters, and learns in practically no time. We argue that these results are indicative of issues in current approaches to Visual Dialogue and conduct analyses to highlight implicit dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Speech and dialogue systems
