Efficient Large-Scale Multi-Modal Classification
D. Kiela, E. Grave, A. Joulin, T. Mikolov

TL;DR
This paper explores efficient methods for large-scale multi-modal classification combining text and visual features, demonstrating that discretized fusion improves accuracy, speed, and interpretability over text-only models.
Contribution
It introduces a simple, computationally efficient fusion approach using discretized visual features that outperforms text-only classification and enhances interpretability.
Findings
Discretized multi-modal fusion improves classification accuracy.
Discretized features reduce computational cost significantly.
Fusion with discretized features enhances interpretability.
Abstract
While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities of data quickly. We investigate various methods for performing multi-modal fusion and analyze their trade-offs in terms of classification accuracy and computational efficiency. Our findings indicate that the inclusion of continuous information improves performance over text-only on a range of multi-modal classification tasks, even with simple fusion methods. In addition, we experiment with discretizing the continuous features in order to speed up and simplify the fusion process even further.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
