JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
Leying Zhang, Bowen Shi, Haibin Wu, Bach Viet Do, Yanmin Qian

TL;DR
JASTIN is a versatile, instruction-based audio evaluation framework that leverages a frozen audio encoder and a fine-tuned LLM to achieve state-of-the-art correlation with human ratings across diverse audio domains in a zero-shot setting.
Contribution
The paper introduces JASTIN, a novel instruction-driven audio evaluation method that generalizes well across domains without task-specific retraining.
Findings
JASTIN achieves state-of-the-art correlation with human ratings.
It outperforms general multimodal LLMs across various audio tasks.
JASTIN maintains zero-shot generalization across out-of-domain data.
Abstract
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
