Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex   Capabilities

Zhifei Xie; Changqiao Wu

arXiv:2410.11190·eess.AS·November 6, 2024·3 cites

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Zhifei Xie, Changqiao Wu

PDF

Open Access 1 Repo 4 Models

TL;DR

Mini-Omni2 is a multi-modal assistant model that integrates visual, auditory, and textual understanding, providing real-time voice responses and flexible interaction, advancing open-source multi-modal AI capabilities.

Contribution

It introduces Mini-Omni2, a unified multi-modal model with a three-stage training process and command-based interaction, closely replicating GPT-4o's functionalities using open-source components.

Findings

01

Maintains performance across visual and auditory modalities

02

Enables real-time voice responses to multi-modal queries

03

Supports flexible user interaction through command-based interruption

Abstract

GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpt-omni/mini-omni2
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Computational Physics and Python Applications

MethodsALIGN