Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Alan Dao (Gia Tuan Dao); Dinh Bach Vu; Huy Hoang Ha

arXiv:2410.15316·cs.CL·April 7, 2025

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha

PDF

Open Access 2 Repos 2 Models

TL;DR

Ichigo is a novel mixed-modal model that integrates speech and text processing through early fusion, enabling real-time reasoning and generation with low latency, advancing multimodal AI capabilities.

Contribution

It introduces a tokenized early-fusion approach with a unified transformer architecture for speech and text, facilitating joint reasoning without separate adapters.

Findings

01

Achieves state-of-the-art performance on speech question-answering benchmarks.

02

Demonstrates a latency of 111 ms to first token generation.

03

Outperforms existing open-source speech language models.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems