Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue
Shoya Matsumori, Kosuke Shingyouchi, Yuki Abe, Yosuke Fukuchi, Komei, Sugiura, and Michita Imai

TL;DR
This paper introduces UniQer, a transformer-based model for generating descriptive questions in goal-oriented visual dialogue, and presents a new dataset, CLEVR Ask, to evaluate complex scene understanding.
Contribution
The paper proposes a novel Unified Questioner Transformer architecture and a new dataset for complex, descriptive question generation in visual dialogue.
Findings
UniQer outperforms baseline models in quantitative evaluations.
The CLEVR Ask dataset enables testing of complex scene understanding.
Descriptive questions improve object differentiation in visual dialogue.
Abstract
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dropout · Label Smoothing
