TL;DR
AuriStream is a biologically inspired two-stage speech model that transforms raw audio into cochlear tokens and applies autoregressive prediction, achieving strong performance and interpretability in speech tasks.
Contribution
It introduces a novel cochlear token-based representation and autoregressive modeling framework inspired by human auditory processing.
Findings
Achieves state-of-the-art results on SUPERB speech tasks.
Generates interpretable audio continuations visualized in spectrograms.
Learns meaningful phoneme and word representations.
Abstract
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The model's biologically-inspired cochlear token framework is well-aligned with human auditory processing, making it a promising approach for more human-like and interpretable speech representations. 2. The authors validate the model’s versatility across various tasks (phoneme/word decoding, lexical semantics) and benchmarks (e.g., SUPERB), demonstrating its competitive performance and interpretability advantages over existing models.
1. The paper lacks formulas and a clear explanation of the cochlear representation, as well as a model architecture diagram. This makes it difficult to understand how the cochlear representation is converted into audio and how it compares to or provides advantages over the mel representation. A more thorough theoretical or visual explanation of the cochlear encoding process would enhance clarity. 2. The experimental results on linear probing performance for phonemes and words on the TIMIT data
- CochStream explores an audio representation generation method inspired by the human cochlea. - CochStream was evaluated on the SUPERB benchmark for tasks such as speech recognition, intent classification, and speech separation. - The model can visualize predictions in cochleagram form, offering interpretable insights into its speech representations.
- The baseline model seems a bit weak and should try to incorporate newer and more powerful models such as Whisper; additionally, CochStream claims to have the appearance of acoustic information, so that should be compared with models that aim at audio reconstruction such as Encodec, DAC, Soundstream; - Based on the results in Tables 1 and 3, the performance improvement of CochStream over the baseline models appears limited. I would like the authors to further clarify the main advantages of Coch
The paper is well-written, with a clear and interesting motivation rooted in mimicking the human auditory system. The authors’ use of cochleograms as intermediate acoustic representations is compelling, and the performance of these biologically inspired features on downstream tasks is promising. The results show that hand-crafted, biologically motivated features can indeed achieve competitive performance, which could have significant implications for the design of speech processing systems.
While the paper achieves competitive performance on benchmark tasks, it relies on general-purpose, objective ML measures to validate the proposed features. This focus shifts away from the original biological motivation to a more standard ML performance evaluation. For researchers focused on ML, data-driven features generally remain more attractive due to better performance across tasks and the lack of a need for hand-crafting input features. To realign with the biological motivation, the paper w
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
