EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami

TL;DR
EMMA introduces a lightweight, efficient fusion module for multi-modal large language models, enhancing task performance and robustness with minimal additional parameters.
Contribution
The paper presents EMMA, a novel cross-modality fusion module that improves visual and textual integration in MLLMs with minimal complexity increase.
Findings
Up to 9.3% performance improvement on benchmarks
Significant robustness gains against hallucinations
Minimal parameter increase (<0.2%) in model size
Abstract
Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the…
Peer Reviews
Decision·Submitted to ICLR 2025
* Leveraging the inherent alignment properties of pre-trained CLIP text and vision encoders to implement an early fusion module is intuitive. * The model's structural design is characterized by its simplicity and clarity. * The analysis presents a novel perspective on modality adaptation.
* In line 251, it is mentioned that current methodologies depend on complex cross-modality modules, particularly highlighting mPLUG-Owl2. However, the paper does not provide a theoretical comparison with methods like LLaVA and Qwen-VL, which utilize straightforward projection layers that add only a minimal number of parameters to achieve visual-language alignment. * The proposed EMMA introduces a Modality Adaptation module based on the correlation between word tokens in instructions and visual
1. Detailed analytical insights: The paper offers a thorough analysis of the modality alignment process through l1, l2, and mutual information comparisons. This analysis provides deeper insights into how visual and textual tokens are integrated within the alignment module, enhancing the interpretability of the model's decision-making process. 2. Robustness against hallucinations: The empirical results highlight EMMA's superior ability to avoid hallucinations in multi-modal tasks, as demonstrat
1. Inconsistent writing and presentation: The overall writing of the paper is somewhat disorganized, making it difficult to follow and reducing its readability. For example, Section 3.1 is intended to introduce the EMMA method, but the section ends with a discussion of experimental results (in the last paragraph on page 6). Similarly, in Section 3.2, which is supposed to focus on the Analysis on Modality Adaptation by EMMA, the section starts by explaining model details, which would be better su
1-Clear motivation: Improving effeciency is important for MLLMs. 2-Well-written paper: Easy to follow. 3-Substantial experiments and corresponding discussion.
1-The model is very similar to LLaVA: EMMA introduces CLIP-text encoder and an early fusion module. 2-The involvement of CLIP-text encoder is not very convincing: CLIP is trained for classification task only, but EMMA is a general-purposed instruction-aware MLLM. I am doubtful about the effectiveness when CLIP-text encoder deals with long instructions. You may make some visualization of the activation when CLIP encounter some long instructions / tasks apart from classification. 3-The discussion
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques
