Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

Guojian Li; Zhixian Zhao; Zhennan Lin; Jingbin Hu; Qirui Zhan; Yuang Cao; Pengyuan Xie; Chuan Xie; Jie Liu; Qiang Zhang; Zhonghua Fu; Lei Xie

arXiv:2605.12036·eess.AS·May 13, 2026

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

Guojian Li, Zhixian Zhao, Zhennan Lin, Jingbin Hu, Qirui Zhan, Yuang Cao, Pengyuan Xie, Chuan Xie, Jie Liu, Qiang Zhang, Zhonghua Fu, Lei Xie

PDF

TL;DR

This paper introduces a new data pipeline, benchmark, and model to enhance fine-grained, multi-dimensional speech understanding, addressing current limitations in perception and modeling of complex acoustic features.

Contribution

It presents a high-quality spontaneous speech corpus, a comprehensive benchmark for 14 speech attributes, and a novel model with curriculum fine-tuning to improve multi-dimensional speech perception.

Findings

01

Current speech LLMs need significant improvement in multi-dimensional understanding.

02

FM-Speech outperforms existing open-source models in fine-grained perception.

03

The new benchmark reveals gaps in current models' capabilities.

Abstract

While speech Large Language Models (LLMs) excel at conventional tasks like basic speech recognition, they lack fine-grained, multi-dimensional perception. This deficiency is evident in their struggle to disentangle complex features like micro-acoustic cues, acoustic scenes, and paralinguistic signals. This resulting incomplete comprehension of real-world speech fundamentally bottlenecks the development of perceptive and empathetic next-generation speech systems. At its core, this persistent perceptual limitation primarily stems from three interacting factors: scarce high-quality expressive data, absent fine-grained modeling for multi-dimensional attributes, and reliance on restricted coverage, coarse-grained benchmarks. We address these challenges through three pillars: First, our robust data curation pipeline resolves complex acoustic environments and long-audio timestamp alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.