TL;DR
This paper introduces a novel audio super-resolution network that combines convolutional layers with self-attention mechanisms, capturing long-range dependencies more effectively and enabling faster training.
Contribution
It proposes a new architecture integrating self-attention with convolutional networks and introduces AFiLM for improved modulation in audio super-resolution.
Findings
Outperforms existing methods on standard benchmarks
Enables more parallelization and faster training
Effectively models long-range dependencies in audio sequences
Abstract
Convolutions operate only locally, thus failing to model global interactions. Self-attention is, however, able to learn representations that capture long-range dependencies in sequences. We propose a network architecture for audio super-resolution that combines convolution and self-attention. Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model. Extensive experiments show that our model outperforms existing approaches on standard benchmarks. Moreover, it allows for more parallelization resulting in significantly faster training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods1x1 Convolution
