Audio Outperforms Text for Visual Decoding
Zhengdi Zhang, Hao Zhang, Wenjun Xia

TL;DR
This study demonstrates that auditory semantic representations outperform textual ones in zero-shot decoding of visual brain activity, highlighting the importance of auditory modality in neural decoding and brain-computer interface development.
Contribution
Introduces the first framework comparing auditory and textual semantics in neural decoding, with a novel multimodal alignment model utilizing auditory representations.
Findings
Auditory modality yields higher decoding accuracy than textual modality.
Auditory representations are more aligned with neural activity during visual processing.
Auditory-based decoding is more computationally efficient.
Abstract
Decoding visual semantic representations from human brain activity is a significant challenge. While recent zero-shot decoding approaches have improved performance by leveraging aligned image-text datasets, they overlook a fundamental aspect of human cognition: semantic understanding is inherently anchored in the auditory modality of speech, not text. To address this, our study introduces the first comparative framework for evaluating auditory versus textual semantic modalities in zero-shot visual neural decoding. We propose a novel brain-visual-auditory multimodal alignment model that directly utilizes auditory representations to encapsulate semantics, serving as a substitute for traditional textual descriptors. Our experimental results demonstrate that the auditory modality not only surpasses the textual modality in decoding accuracy but also achieves higher computational efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Multimodal Machine Learning Applications · EEG and Brain-Computer Interfaces
