When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining
Juan Yeo, Jinkwan Jang, Kyubyung Chae, Seongkyu Mun, Taesup Kim

TL;DR
This paper introduces Look Aside Adapters (LoAA) that enable vision models to perform well on audio tasks without needing large-scale audio pretraining, by facilitating interactions across time and frequency dimensions.
Contribution
The paper proposes a novel adapter design, LoAA, allowing direct fine-tuning of vision models for audio understanding without pretraining on audio data.
Findings
LoAA enables vision models to match or outperform pretrained audio models.
Efficient adaptation reduces the need for extensive audio pretraining.
Adapters facilitate cross-dimensional interactions in audio spectrum data.
Abstract
Recent studies show that pretrained vision models can boost performance in audio downstream tasks. To enhance the performance further, an additional pretraining stage with large scale audio data is typically required to infuse audio specific knowledge into the vision model. However, such approaches require extensive audio data and a carefully designed objective function. In this work, we propose bypassing the pretraining stage by directly fine-tuning the vision model with our Look Aside Adapter (LoAA) designed for efficient audio understanding. Audio spectrum data is represented across two heterogeneous dimensions time and frequency and we refine adapters to facilitate interactions between tokens across these dimensions. Our experiments demonstrate that our adapters allow vision models to reach or surpass the performance of pretrained audio models in various audio and speech tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor Science and Applications · Advanced Vision and Imaging · Image Enhancement Techniques
MethodsAdapter
