MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
Jingyao Gong

TL;DR
MiniMind-O is an open small-scale omni model capable of processing text, speech, and images, providing both text and speech outputs, with detailed datasets and design insights for small multimodal models.
Contribution
The paper introduces MiniMind-O, a novel 0.1B-scale omni model with multimodal capabilities, including a new sequence format and design choices for small-scale models.
Findings
Achieves low CERs of around 0.09 in Thinker--Talker consistency evaluation.
Provides a comprehensive dataset and code for multimodal training.
Identifies key scale-critical design choices for small omni models.
Abstract
MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
