A Preliminary Exploration with GPT-4o Voice Mode
Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu, Huang, Xuanjun Chen, Hung-yi Lee

TL;DR
This paper evaluates GPT-4o's audio and reasoning capabilities, highlighting its strengths in understanding speech and music, robustness against hallucinations, and limitations in certain tasks and safety-related refusals.
Contribution
It provides a preliminary assessment of GPT-4o's multimodal audio processing abilities and safety mechanisms, revealing its strengths and weaknesses across various tasks.
Findings
Strong performance in speech, music, and multilingual tasks
Greater robustness against hallucinations compared to other models
Struggles with audio duration prediction and instrument classification
Abstract
With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
