Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small   Language Model

Ben Koska; Mojm\'ir Horv\'ath

arXiv:2411.05903·cs.LG·November 12, 2024

Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model

Ben Koska, Mojm\'ir Horv\'ath

PDF

Open Access

TL;DR

This paper introduces a compact 4.5B parameter multi-modal language model capable of processing text, images, videos, and audio, achieving near state-of-the-art results across various tasks and benchmarks.

Contribution

It presents a novel multi-modal model that combines recent language modeling and multi-task learning techniques in a small size suitable for edge deployment.

Findings

01

Achieves near state-of-the-art performance on multiple benchmarks

02

Demonstrates versatility across diverse input modalities

03

Supports deployment for edge inference

Abstract

We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling