Lightweight Cross-Modal Representation Learning

Bilal Faye; Hanane Azzag; Mustapha Lebbah; Djamel Bouchaffra

arXiv:2403.04650·cs.LG·September 10, 2024·1 cites

Lightweight Cross-Modal Representation Learning

Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra

PDF

Open Access 1 Repo

TL;DR

LightCRL introduces a single neural network approach for cross-modal representation learning, reducing resource requirements while maintaining high performance across diverse data modalities.

Contribution

The paper proposes LightCRL, a novel lightweight method using a Deep Fusion Encoder to efficiently learn shared representations across multiple modalities.

Findings

01

Achieves comparable performance to complex models

02

Reduces parameter count significantly

03

Demonstrates robustness across modalities

Abstract

Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representation Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modalities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

b-faye/lightweightcrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems