Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective
Chen Chen, Xiaolou Li, Zehua Liu, Lantian Li, Dong Wang

TL;DR
This paper applies information theory to quantitatively analyze audio-visual speech tasks, revealing insights into the challenges and benefits of modality integration in spoken language processing.
Contribution
It introduces a novel information-theoretic framework for analyzing audio-visual tasks, addressing the lack of theoretical understanding in this area.
Findings
Information intersection explains task difficulties.
Modality integration offers significant benefits.
Analysis guides future audio-visual research.
Abstract
In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies
