Alternating Gradient Descent and Mixture-of-Experts for Integrated   Multimodal Perception

Hassan Akbari; Dan Kondratyuk; Yin Cui; Rachel Hornung; Huisheng Wang,; Hartwig Adam

arXiv:2305.06324·cs.CV·December 12, 2023·6 cites

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang,, Hartwig Adam

PDF

Open Access

TL;DR

This paper introduces IMP, a scalable multimodal perception model that combines alternating gradient descent and mixture-of-experts to efficiently handle diverse modalities and tasks, achieving state-of-the-art results in zero-shot video classification.

Contribution

The paper proposes a novel integrated multimodal perception framework using AGD and MoE, enabling efficient scaling and improved performance across multiple modalities and tasks.

Findings

01

AGD on diverse modalities enhances model performance.

02

MoE significantly improves accuracy and mitigates modality conflicts.

03

Sparse IMP-MoE-L achieves state-of-the-art zero-shot video classification with reduced training cost.

Abstract

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing