TCT: A Cross-supervised Learning Method for Multimodal Sequence   Representation

Wubo Li; Wei Zou; Xiangang Li

arXiv:1911.05186·cs.CV·November 14, 2019

TCT: A Cross-supervised Learning Method for Multimodal Sequence Representation

Wubo Li, Wei Zou, Xiangang Li

PDF

Open Access

TL;DR

This paper introduces TCT, a cross-supervised learning method using transformers to improve multimodal sequence representations, achieving state-of-the-art results in video-grounded dialogue tasks.

Contribution

The paper proposes TCT, a novel transformer-based cross-supervised learning approach for multimodal sequences, enhancing semantic representation over traditional unimodal methods.

Findings

01

TCT improves semantic quality of multimodal representations.

02

MTN-TCT achieves new state-of-the-art in video-grounded dialogue.

03

Learned representations outperform direct unimodal approaches.

Abstract

Multimodalities provide promising performance than unimodality in most tasks. However, learning the semantic of the representations from multimodalities efficiently is extremely challenging. To tackle this, we propose the Transformer based Cross-modal Translator (TCT) to learn unimodal sequence representations by translating from other related multimodal sequences on a supervised learning method. Combined TCT with Multimodal Transformer Network (MTN), we evaluate MTN-TCT on the video-grounded dialogue which uses multimodality. The proposed method reports new state-of-the-art performance on video-grounded dialogue which indicates representations learned by TCT are more semantics compared to directly use unimodality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax