Parameter-Efficient Transfer Learning for Audio-Visual-Language Tasks
Hongye Liu, Xianhai Xie, Yang Gao, Size Li, Zhou YU

TL;DR
This paper introduces LSTTA, a novel adapter-based method for efficient trimodal video understanding that models dense interactions across audio, visual, and language modalities, improving performance while reducing fine-tuning complexity.
Contribution
The paper proposes a new Long Short-Term Trimodal Adapter (LSTTA) architecture for trimodal tasks, enabling flexible, parameter-efficient learning across three modalities with novel modules for temporal and local interactions.
Findings
LSTTA outperforms existing trimodal methods on four benchmark tasks.
LSTTA effectively models dense interactions across audio, visual, and language modalities.
LSTTA is compatible with various pre-trained unimodal or bimodal models.
Abstract
The pretrain-then-finetune paradigm has been widely used in various unimodal and multimodal tasks. However, finetuning all the parameters of a pre-trained model becomes prohibitive as the model size grows exponentially. To address this issue, the adapter mechanism that freezes the pre-trained model and only finetunes a few extra parameters is introduced and delivers promising results. Most studies on adapter architectures are dedicated to unimodal or bimodal tasks, while the adapter architectures for trimodal tasks have not been investigated yet. This paper introduces a novel Long Short-Term Trimodal Adapter (LSTTA) approach for video understanding tasks involving audio, visual, and language modalities. Based on the pre-trained from the three modalities, the designed adapter module is inserted between the sequential blocks to model the dense interactions across the three modalities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
