A Survey on Audio Synthesis and Audio-Visual Multimodal Processing
Zhaofeng Shi

TL;DR
This survey reviews recent advances in audio synthesis and audio-visual multimodal processing, covering techniques like TTS and music generation, and discusses future research directions in these rapidly evolving fields.
Contribution
It provides a comprehensive classification and analysis of current methods in audio synthesis and multimodal processing, highlighting future development trends.
Findings
Classification of technical methods in audio synthesis and multimodal processing
Analysis of current research trends and future directions
Guidance for researchers in related fields
Abstract
With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
