TL;DR
This paper introduces an end-to-end neural framework for decoding speech from neural activity, achieving state-of-the-art results and enabling cross-task generalization in brain-computer interfaces.
Contribution
The paper presents a novel cross-species pretrained neural encoder integrated into an end-to-end speech decoding model, surpassing previous benchmarks and reducing word error rates significantly.
Findings
Achieved new state-of-the-art on Brain-to-Text benchmarks.
Reduced word error rate from 24.69% to 10.22%.
Small-scale audio LLMs improve decoding performance.
Abstract
Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
