# MUTAN: Multimodal Tucker Fusion for Visual Question Answering

**Authors:** Hedi Ben-younes, R\'emi Cadene, Matthieu Cord, Nicolas Thome

arXiv: 1705.06676 · 2017-05-23

## TL;DR

MUTAN introduces a tensor-based Tucker decomposition for efficient and interpretable multimodal fusion in Visual Question Answering, achieving state-of-the-art results by controlling model complexity.

## Contribution

It proposes a novel Tucker decomposition-based fusion method for VQA that balances complexity and interpretability, outperforming previous models.

## Key findings

- Achieves state-of-the-art VQA performance
- Effectively manages high-dimensional multimodal interactions
- Provides interpretable fusion relations

## Abstract

Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations. Additionally to the Tucker framework, we design a low-rank matrix-based decomposition to explicitly constrain the interaction rank. With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations. We show how our MUTAN model generalizes some of the latest VQA architectures, providing state-of-the-art results.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.06676/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/1705.06676/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1705.06676/full.md

---
Source: https://tomesphere.com/paper/1705.06676