Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers
Francisco Portillo L\'opez

TL;DR
This paper compares AV-HuBERT's multisensory speech perception to humans, showing it mimics some biological responses but lacks neural variability, highlighting strengths and limitations of current AI models.
Contribution
It provides a detailed benchmark of AV-HuBERT's responses to audiovisual incongruences, revealing both similarities and differences with human perception.
Findings
AI and humans have nearly identical auditory dominance rates.
AV-HuBERT exhibits a deterministic phonetic fusion bias.
Humans show perceptual stochasticity and diverse error profiles.
Abstract
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultisensory perception and integration · Tactile and Sensory Interactions · Neuroscience and Music Perception
