SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space
Liu Yang

TL;DR
This paper introduces SeqDialN, a sequential visual dialog network that models dialog as a sequence of joint visual-linguistic representations, utilizing lightweight fusion and advanced reasoning modules to achieve state-of-the-art results.
Contribution
The paper proposes a novel sequence-based approach for visual dialog using joint representations and introduces two inference models, including a Transformer-based multi-step reasoning network.
Findings
Achieved new state-of-the-art on VisDial v1.0 with 63.78% NDCG
Demonstrated effectiveness of Transformer-based reasoning over LSTM
Fine-tuning with dense annotations significantly boosts performance
Abstract
In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Label Smoothing · Adam · Dropout · Multi-Head Attention · Softmax
