Efficient Large-Scale Multi-Modal Classification

D. Kiela; E. Grave; A. Joulin; T. Mikolov

arXiv:1802.02892·cs.CL·February 9, 2018·36 cites

Efficient Large-Scale Multi-Modal Classification

D. Kiela, E. Grave, A. Joulin, T. Mikolov

PDF

Open Access

TL;DR

This paper explores efficient methods for large-scale multi-modal classification combining text and visual features, demonstrating that discretized fusion improves accuracy, speed, and interpretability over text-only models.

Contribution

It introduces a simple, computationally efficient fusion approach using discretized visual features that outperforms text-only classification and enhances interpretability.

Findings

01

Discretized multi-modal fusion improves classification accuracy.

02

Discretized features reduce computational cost significantly.

03

Fusion with discretized features enhances interpretability.

Abstract

While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities of data quickly. We investigate various methods for performing multi-modal fusion and analyze their trade-offs in terms of classification accuracy and computational efficiency. Our findings indicate that the inclusion of continuous information improves performance over text-only on a range of multi-modal classification tasks, even with simple fusion methods. In addition, we experiment with discretizing the continuous features in order to speed up and simplify the fusion process even further.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings