Lightweight Cross-Modal Representation Learning
Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra

TL;DR
LightCRL introduces a single neural network approach for cross-modal representation learning, reducing resource requirements while maintaining high performance across diverse data modalities.
Contribution
The paper proposes LightCRL, a novel lightweight method using a Deep Fusion Encoder to efficiently learn shared representations across multiple modalities.
Findings
Achieves comparable performance to complex models
Reduces parameter count significantly
Demonstrates robustness across modalities
Abstract
Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representation Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modalities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
