Unified Questioner Transformer for Descriptive Question Generation in   Goal-Oriented Visual Dialogue

Shoya Matsumori; Kosuke Shingyouchi; Yuki Abe; Yosuke Fukuchi; Komei; Sugiura; and Michita Imai

arXiv:2106.15550·cs.CV·June 30, 2021

Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue

Shoya Matsumori, Kosuke Shingyouchi, Yuki Abe, Yosuke Fukuchi, Komei, Sugiura, and Michita Imai

PDF

Open Access 1 Repo

TL;DR

This paper introduces UniQer, a transformer-based model for generating descriptive questions in goal-oriented visual dialogue, and presents a new dataset, CLEVR Ask, to evaluate complex scene understanding.

Contribution

The paper proposes a novel Unified Questioner Transformer architecture and a new dataset for complex, descriptive question generation in visual dialogue.

Findings

01

UniQer outperforms baseline models in quantitative evaluations.

02

The CLEVR Ask dataset enables testing of complex scene understanding.

03

Descriptive questions improve object differentiation in visual dialogue.

Abstract

Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

smatsumori/uniqer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dropout · Label Smoothing