Scaling Law Hypothesis for Multimodal Model

Qingyun Sun; Zhen Guo; PIN AI Team

arXiv:2409.06754·cs.LG·November 12, 2024

Scaling Law Hypothesis for Multimodal Model

Qingyun Sun, Zhen Guo, PIN AI Team

PDF

Open Access

TL;DR

This paper introduces a scaling law hypothesis for multimodal models, predicting performance based on modality-specific factors, and investigates how multi-modal data can reduce model size for efficient deployment.

Contribution

It extends established scaling laws to multimodal systems and explores data-driven model size reduction for resource-efficient deployment.

Findings

01

Performance can be predicted by modality-specific compression and tokenization.

02

Multi-modal training data can reduce model size.

03

Potential for deploying smaller models on resource-constrained devices.

Abstract

We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation