TL;DR
This paper introduces a method that combines audio and extremely low-resolution images to enhance face super-resolution, recovering facial details and attributes without human annotation by leveraging multimodal representations.
Contribution
The novel approach fuses audio and visual data to improve extreme face super-resolution, enabling attribute recovery and semantic face generation without manual annotations.
Findings
Audio aids in recovering facial attributes like gender and age.
The model can generate realistic faces by mixing audio and low-res images from different videos.
The approach does not require human annotation and can be trained on existing video datasets.
Abstract
We propose a novel method to use both audio and a low-resolution image to perform extreme face super-resolution (a 16x increase of the input size). When the resolution of the input image is very low (e.g., 8x8 pixels), the loss of information is so dire that important details of the original identity have been lost and audio can aid the recovery of a plausible high-resolution image. In fact, audio carries information about facial attributes, such as gender and age. To combine the aural and visual modalities, we propose a method to first build the latent representations of a face from the lone audio track and then from the lone low-resolution image. We then train a network to fuse these two representations. We show experimentally that audio can assist in recovering attributes such as the gender, the age and the identity, and thus improve the correctness of the high-resolution image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Learning to Have an Ear for Face Super-Resolution· youtube
