SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic   Representation Space

Liu Yang

arXiv:2008.00397·cs.CV·April 28, 2022·1 cites

SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

Liu Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces SeqDialN, a sequential visual dialog network that models dialog as a sequence of joint visual-linguistic representations, utilizing lightweight fusion and advanced reasoning modules to achieve state-of-the-art results.

Contribution

The paper proposes a novel sequence-based approach for visual dialog using joint representations and introduces two inference models, including a Transformer-based multi-step reasoning network.

Findings

01

Achieved new state-of-the-art on VisDial v1.0 with 63.78% NDCG

02

Demonstrated effectiveness of Transformer-based reasoning over LSTM

03

Fine-tuning with dense annotations significantly boosts performance

Abstract

In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaoxiaoheimei/SeqDialN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Label Smoothing · Adam · Dropout · Multi-Head Attention · Softmax