A Preliminary Exploration with GPT-4o Voice Mode

Yu-Xiang Lin; Chih-Kai Yang; Wei-Chih Chen; Chen-An Li; Chien-yu; Huang; Xuanjun Chen; Hung-yi Lee

arXiv:2502.09940·cs.CL·February 17, 2025

A Preliminary Exploration with GPT-4o Voice Mode

Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu, Huang, Xuanjun Chen, Hung-yi Lee

PDF

Open Access

TL;DR

This paper evaluates GPT-4o's audio and reasoning capabilities, highlighting its strengths in understanding speech and music, robustness against hallucinations, and limitations in certain tasks and safety-related refusals.

Contribution

It provides a preliminary assessment of GPT-4o's multimodal audio processing abilities and safety mechanisms, revealing its strengths and weaknesses across various tasks.

Findings

01

Strong performance in speech, music, and multilingual tasks

02

Greater robustness against hallucinations compared to other models

03

Struggles with audio duration prediction and instrument classification

Abstract

With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications