# Multimodal Transformer for Unaligned Multimodal Language Sequences

**Authors:** Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico, Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

arXiv: 1906.00295 · 2019-06-04

## TL;DR

The paper introduces MulT, a multimodal transformer that effectively models unaligned multimodal language sequences by capturing crossmodal interactions without explicit data alignment, outperforming existing methods.

## Contribution

The novel Multimodal Transformer (MulT) employs directional crossmodal attention to handle unaligned multimodal sequences in an end-to-end manner, improving over prior approaches.

## Key findings

- Outperforms state-of-the-art methods on aligned and non-aligned data
- Effectively captures crossmodal signals through attention mechanism
- Demonstrates robustness across different sampling rates

## Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.00295/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1906.00295/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1906.00295/full.md

---
Source: https://tomesphere.com/paper/1906.00295