SongTrans: An unified song transcription and alignment method for lyrics and notes
Siwei Wu, Jinzheng He, Ruibin Yuan, Haojie Wei, Xipin Wei, Chenghua, Lin, Jin Xu, Junyang Lin

TL;DR
SongTrans is a unified model that simultaneously transcribes lyrics and notes from songs and aligns them, eliminating the need for pre-processing and improving efficiency in singing voice synthesis tasks.
Contribution
The paper introduces SongTrans, the first model capable of joint lyric and note transcription with alignment, trained on annotated data and optimized for real-world song diversity.
Findings
Achieves state-of-the-art results in lyric and note transcription.
First model to effectively align lyrics with notes.
Demonstrates versatility across different song types.
Abstract
The quantity of processed data is crucial for advancing the field of singing voice synthesis. While there are tools available for lyric or note transcription tasks, they all need pre-processed data which is relatively time-consuming (e.g., vocal and accompaniment separation). Besides, most of these tools are designed to address a single task and struggle with aligning lyrics and notes (i.e., identifying the corresponding notes of each word in lyrics). To address those challenges, we first design a pipeline by optimizing existing tools and annotating numerous lyric-note pairs of songs. Then, based on the annotated data, we train a unified SongTrans model that can directly transcribe lyrics and notes while aligning them simultaneously, without requiring pre-processing songs. Our SongTrans model consists of two modules: (1) the \textbf{Autoregressive module} predicts the lyrics, along with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis
