AnCoGen: Analysis, Control and Generation of Speech with a Masked   Autoencoder

Samir Sadok; Simon Leglaive; Laurent Girin; Ga\"el Richard; Xavier; Alameda-Pineda

arXiv:2501.05332·cs.SD·January 10, 2025

AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Samir Sadok, Simon Leglaive, Laurent Girin, Ga\"el Richard, Xavier, Alameda-Pineda

PDF

1 Repo

TL;DR

AnCoGen is a unified masked autoencoder model that analyzes, controls, and generates speech by estimating key attributes and enabling precise modifications, demonstrated through various speech processing tasks.

Contribution

It introduces a novel masked autoencoder framework that unifies speech analysis, control, and generation in a single model, enabling versatile speech processing capabilities.

Findings

01

Effective in speech analysis-resynthesis

02

Accurate pitch estimation and modification

03

Improves speech enhancement tasks

Abstract

This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samsad35/code-ancogen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.