Baichuan-Omni-1.5 Technical Report
Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li,, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo, Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo,, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li

TL;DR
Baichuan-Omni-1.5 is a multimodal model that integrates understanding and audio generation, trained on 500 billion high-quality multimodal data, and outperforms many existing models in omni-modal tasks.
Contribution
The paper introduces a comprehensive multimodal training pipeline, a novel audio tokenizer, and a multi-stage training strategy for an advanced omni-modal model.
Findings
Achieves comparable performance to leading models on multimodal benchmarks.
Outperforms contemporary models in omni-modal understanding and generation.
Utilizes a large-scale high-quality multimodal dataset.
Abstract
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Sensor and Control Systems · Advanced Algorithms and Applications
