Phantom of Latent for Large Language and Vision Models

Byung-Kwan Lee; Sangyun Chung; Chae Won Kim; Beomchan Park; Yong Man; Ro

arXiv:2409.14713·cs.CV·September 24, 2024

Phantom of Latent for Large Language and Vision Models

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man, Ro

PDF

Open Access 1 Repo 4 Models 3 Reviews

TL;DR

This paper introduces Phantom, an efficient large language and vision model family that enhances learning capacity by temporarily increasing latent dimensions, achieving high performance with fewer parameters.

Contribution

The paper proposes Phantom, a new LLVM architecture that improves efficiency by increasing latent dimensions during attention, combined with optimization techniques, outperforming larger models.

Findings

01

Phantom models outperform larger open- and closed-source LLVMs.

02

Temporary latent dimension increase enhances vision-language understanding.

03

Efficient models with fewer parameters achieve competitive performance.

Abstract

The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 5

Strengths

The authors present an efficient LLVM family Phantom with enhanced learning capabilities within limited model sizes. They introduce Phantom Optimization (PO) that seems interesting. Phantom demonstrates good performance in their evaluations.

Weaknesses

1. Authors only compare with other open-sourced Multimodal LLMs (MLLMs) using their checkpoints. They lack comparisons with related baseline methods using the same pre-trained models, datasets, and training configurations. There are several baseline methods [1,2,3,4,5] that contribute to the training algorithms of MLLMs. Besides, there are also huge amounts of works talking about how to modify the attention mechanism in Transformer models that need to be discussed. 2. It seems that the authors

Reviewer 02Rating 6Confidence 3

Strengths

**Comprehensive evaluation** The authors evaluate their VLM architecture + dataset + training strategy on a variety of vision-language benchmarks, and show good performance compared to various baselines (both open and closed models) **Method ablations** I appreciated how the authors ablate several components of their proposed contribution and report performance over multiple tasks to study how these components add to the overall performance: (1) the weighted-average mechanism for combining

Weaknesses

**Method motivation / understanding** While the authors say that increased feature dimension size in the attention interactions is important for improved quality (“to make LLVMs embed much more vision-language knowledge”; L244-245), I still don’t see the motivation for concatenating the cross-attention and self-attention projections. To increase dimensionality while preserving parameter-count, we could also arbitrarily repeat or expand the features from `head_dimension` to `2 * head_dimension`

Reviewer 03Rating 6Confidence 2

Strengths

* Strong results on a range of benchmarks. * A set of ablations that clearly demonstrate which of the introduced components is responsible for the improved performance. * The proposed architectural modification to the MHSA module highlights the importance of further exploring VLLM architectures beyond the native transformer.

Weaknesses

**Major** * "Phantom dimension" - the main of the two contributions of the paper is not described in an accessible manner in the paper, which significantly limits the ability to evaluate the paper and its potential impact. Consider adding pseudocode to describe what exact this modification does. * It appears that the authors (primarily) used open source components to develop their models, however, it's unclear from the write up whether their models will similarly be made open source. **Minor**

Code & Models

Repositories

byungkwanlee/phantom
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques