SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented   Dialogue with Symbolic Scene Representation

Bhathiya Hemanthage; Christian Dondrup; Phil Bartie; Oliver Lemon

arXiv:2307.04907·cs.CL·July 12, 2023

SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation

Bhathiya Hemanthage, Christian Dondrup, Phil Bartie, Oliver Lemon

PDF

Open Access

TL;DR

SimpleMTOD is a transformer-based multimodal dialogue model that effectively integrates visual scene semantics using symbolic tokens, achieving state-of-the-art results with a minimalist design and transfer learning from GPT-2.

Contribution

It introduces symbolic scene representation with local and de-localized tokens in a simple, transfer learning-based architecture for multimodal task-oriented dialogue.

Findings

01

Achieves state-of-the-art BLEU score of 0.327 in response generation.

02

Performs on par with state-of-the-art in disambiguation, coreference, and dialog state tracking.

03

Uses a minimalist approach without task-specific architectural modifications.

Abstract

SimpleMTOD is a simple language model which recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is built on a large-scale transformer-based auto-regressive architecture, which has already proven to be successful in uni-modal task-oriented dialogues, and effectively leverages transfer learning from pre-trained GPT-2. In-order to capture the semantics of visual scenes, we introduce both local and de-localized tokens for objects within a scene. De-localized tokens represent the type of an object rather than the specific object itself and so possess a consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0 test-std dataset while performing on par in other multimodal sub-tasks: Disambiguation, Coreference Resolution, and Dialog State Tracking.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · Weight Decay · Adam · Dense Connections