MoAI: Mixture of All Intelligence for Large Language and Vision Models
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

TL;DR
MoAI introduces a novel approach that integrates external computer vision model outputs with large language models, enhancing real-world scene understanding without increasing model size or requiring extensive new datasets.
Contribution
The paper presents MoAI, a new LLVM framework that leverages auxiliary visual information from external CV models through two modules, improving scene understanding in VL tasks.
Findings
MoAI outperforms existing LLVMs in zero-shot VL tasks.
MoAI enhances real-world scene understanding without enlarging models.
MoAI does not require additional visual instruction tuning datasets.
Abstract
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
