Scaling Law Hypothesis for Multimodal Model
Qingyun Sun, Zhen Guo, PIN AI Team

TL;DR
This paper introduces a scaling law hypothesis for multimodal models, predicting performance based on modality-specific factors, and investigates how multi-modal data can reduce model size for efficient deployment.
Contribution
It extends established scaling laws to multimodal systems and explores data-driven model size reduction for resource-efficient deployment.
Findings
Performance can be predicted by modality-specific compression and tokenization.
Multi-modal training data can reduce model size.
Potential for deploying smaller models on resource-constrained devices.
Abstract
We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation
