Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms
Tyler Loakman, Joseph James, Chenghua Lin

TL;DR
This paper benchmarks vision-language models on their ability to interpret spectrograms and waveforms of speech, revealing their limited performance and highlighting the need for specialized knowledge in this task.
Contribution
It introduces a novel dataset and evaluation framework to assess VLMs' capacity to interpret speech spectrograms and waveforms, a task they perform poorly on without specific training.
Findings
VLMs rarely outperform chance in spectrogram interpretation
Zero-shot and finetuned models struggle without specialized knowledge
Highlights the gap between VLM capabilities and phonetic understanding
Abstract
With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Language and cultural evolution
