When Vision Models Meet Parameter Efficient Look-Aside Adapters Without   Large-Scale Audio Pretraining

Juan Yeo; Jinkwan Jang; Kyubyung Chae; Seongkyu Mun; Taesup Kim

arXiv:2412.05951·cs.SD·December 10, 2024

When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining

Juan Yeo, Jinkwan Jang, Kyubyung Chae, Seongkyu Mun, Taesup Kim

PDF

Open Access

TL;DR

This paper introduces Look Aside Adapters (LoAA) that enable vision models to perform well on audio tasks without needing large-scale audio pretraining, by facilitating interactions across time and frequency dimensions.

Contribution

The paper proposes a novel adapter design, LoAA, allowing direct fine-tuning of vision models for audio understanding without pretraining on audio data.

Findings

01

LoAA enables vision models to match or outperform pretrained audio models.

02

Efficient adaptation reduces the need for extensive audio pretraining.

03

Adapters facilitate cross-dimensional interactions in audio spectrum data.

Abstract

Recent studies show that pretrained vision models can boost performance in audio downstream tasks. To enhance the performance further, an additional pretraining stage with large scale audio data is typically required to infuse audio specific knowledge into the vision model. However, such approaches require extensive audio data and a carefully designed objective function. In this work, we propose bypassing the pretraining stage by directly fine-tuning the vision model with our Look Aside Adapter (LoAA) designed for efficient audio understanding. Audio spectrum data is represented across two heterogeneous dimensions time and frequency and we refine adapters to facilitate interactions between tokens across these dimensions. Our experiments demonstrate that our adapters allow vision models to reach or surpass the performance of pretrained audio models in various audio and speech tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColor Science and Applications · Advanced Vision and Imaging · Image Enhancement Techniques

MethodsAdapter