MultiWay-Adapater: Adapting large-scale multi-modal models for scalable   image-text retrieval

Zijun Long; George Killick; Richard McCreadie; Gerardo Aragon Camarasa

arXiv:2309.01516·cs.CV·February 7, 2024

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa

PDF

Open Access 1 Repo

TL;DR

This paper introduces MultiWay-Adapter, a lightweight framework that enhances inter-modal alignment in large-scale multimodal models, enabling efficient adaptation for image-text retrieval with minimal training costs and high effectiveness.

Contribution

The paper proposes MultiWay-Adapter with an 'Alignment Enhancer' to improve inter-modal alignment and transferability, reducing training time and parameter increase compared to prior methods.

Findings

01

Reduces training time by up to 57%.

02

Increases model size by only 2-3%.

03

Maintains effectiveness of large multimodal models.

Abstract

As Multimodal Large Language Models (MLLMs) grow in size, adapting them to specialized tasks becomes increasingly challenging due to high computational and memory demands. Indeed, traditional fine-tuning methods are costly, due to the need for extensive, task-specific training. While efficient adaptation methods exist that aim to reduce these costs, in practice they suffer from shallow inter-modal alignment, which severely hurts model effectiveness. To tackle these computational challenges and improve inter-modal alignment, we introduce the MultiWay-Adapter (MWA), a novel framework featuring an 'Alignment Enhancer'. This enhancer deepens inter-modal alignment, enabling high transferability with minimal tuning effort. Our experiments show that unlike prior efficient tuning approaches, MWA maintains model effectiveness, while reducing training time by up-to 57%. MWA is also lightweight,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

longkukuhi/multiway-adapter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques