LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Jainaveen Sundaram, Ravi Iyer

TL;DR
This paper introduces LLaVaOLMoBitnet1B, a fully open-source ternary multimodal large language model capable of processing images and text, aiming to democratize AI with efficient, accessible multimodal capabilities.
Contribution
It presents the first ternary multimodal LLM that accepts image and text inputs, along with open-source training scripts to foster further research.
Findings
Successfully trained a ternary multimodal LLM with competitive performance.
Demonstrated the model's ability to handle image and text inputs coherently.
Provided open-source resources to accelerate multimodal AI research.
Abstract
Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
