Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
Hui Zheng, Hai-Teng Wang, Wei-Bang Jiang, Zhong-Tao Chen, Li He,, Pei-Yang Lin, Peng-Hu Wei, Guo-Guang Zhao, Yun-Zhe Liu

TL;DR
This paper introduces Du-IN, a novel brain decoding model that leverages region-level tokens and mask modeling to improve speech decoding from intracranial neural signals, achieving state-of-the-art results.
Contribution
The paper presents a new region-level token-based model with discrete codex-guided mask modeling for speech decoding from sEEG data, outperforming existing methods.
Findings
Achieved state-of-the-art 61-word classification accuracy.
Region-level temporal modeling with 1D depthwise convolution improves performance.
Self-supervised mask modeling significantly enhances speech decoding accuracy.
Abstract
Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Robotics and Automated Systems
MethodsAttention Is All You Need · Depthwise Convolution · Convolution · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding
