OneEncoder: A Lightweight Framework for Progressive Alignment of   Modalities

Bilal Faye; Hanane Azzag; Mustapha Lebbah

arXiv:2409.11059·cs.CV·September 19, 2024

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Bilal Faye, Hanane Azzag, Mustapha Lebbah

PDF

Open Access 5 Models

TL;DR

OneEncoder introduces a lightweight, progressive framework for aligning multiple modalities such as image, text, audio, and video, reducing reliance on large datasets and extensive retraining.

Contribution

It proposes a novel progressive alignment method using a universal projection module, enabling efficient multi-modal integration with small datasets.

Findings

01

Outperforms existing methods on classification and visual question answering tasks.

02

Operates efficiently with small paired datasets.

03

Reduces the need for large-scale, modality-specific encoders.

Abstract

Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Video Analysis and Summarization

MethodsALIGN