Megrez-Omni Technical Report
Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu,, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang,, Shengen Yan, Guohao Dai, Yu Wang

TL;DR
This paper introduces Megrez models, including a language and multimodal model, optimized for fast, accurate, and robust edge-side AI applications across text, image, and audio modalities.
Contribution
The paper presents the Megrez-3B-Omni multimodal model, achieving state-of-the-art accuracy and robustness for on-device AI across multiple modalities, with a focus on software-hardware co-design.
Findings
Megrez-3B-Omni achieves state-of-the-art multimodal accuracy.
The models are optimized for fast inference and edge deployment.
Demonstrates versatility across text, image, and audio analysis.
Abstract
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Natural Language Processing Techniques
