Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Yifan Yang; Bing Han; Hui Wang; Wei Wang; Ziyang Ma; Long Zhou; Zengrui Jin; Guanrou Yang; Tianrui Wang; Xu Tan; Xie Chen

arXiv:2601.03065·eess.AS·April 21, 2026

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Yifan Yang, Bing Han, Hui Wang, Wei Wang, Ziyang Ma, Long Zhou, Zengrui Jin, Guanrou Yang, Tianrui Wang, Xu Tan, Xie Chen

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces FCaps, a large-scale dataset with fine-grained speech annotations, and CLSP, a contrastive pre-training model that learns multi-granular speech-text representations for various tasks.

Contribution

The paper presents a novel dataset with detailed style annotations and a contrastive pre-training model that captures fine-grained and multi-granular speech-text relationships.

Findings

01

CLSP achieves reliable performance in speech-text retrieval and classification.

02

Annotations surpass existing datasets in correctness, coverage, and naturalness.

03

Model aligns well with human judgments across multiple tasks.

Abstract

Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yfyeung/CLSP
github

Models

🤗
yfyeung/CLSP
model· 2.2k dl· ♡ 3
2.2k dl♡ 3

Datasets

yfyeung/FCaps
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.