Audio Outperforms Text for Visual Decoding

Zhengdi Zhang; Hao Zhang; Wenjun Xia

arXiv:2601.13866·q-bio.NC·January 21, 2026

Audio Outperforms Text for Visual Decoding

Zhengdi Zhang, Hao Zhang, Wenjun Xia

PDF

Open Access

TL;DR

This study demonstrates that auditory semantic representations outperform textual ones in zero-shot decoding of visual brain activity, highlighting the importance of auditory modality in neural decoding and brain-computer interface development.

Contribution

Introduces the first framework comparing auditory and textual semantics in neural decoding, with a novel multimodal alignment model utilizing auditory representations.

Findings

01

Auditory modality yields higher decoding accuracy than textual modality.

02

Auditory representations are more aligned with neural activity during visual processing.

03

Auditory-based decoding is more computationally efficient.

Abstract

Decoding visual semantic representations from human brain activity is a significant challenge. While recent zero-shot decoding approaches have improved performance by leveraging aligned image-text datasets, they overlook a fundamental aspect of human cognition: semantic understanding is inherently anchored in the auditory modality of speech, not text. To address this, our study introduces the first comparative framework for evaluating auditory versus textual semantic modalities in zero-shot visual neural decoding. We propose a novel brain-visual-auditory multimodal alignment model that directly utilizes auditory representations to encapsulate semantics, serving as a substitute for traditional textual descriptors. Our experimental results demonstrate that the auditory modality not only surpasses the textual modality in decoding accuracy but also achieves higher computational efficiency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace Recognition and Perception · Multimodal Machine Learning Applications · EEG and Brain-Computer Interfaces