Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man, Ro

TL;DR
This paper introduces Phantom, an efficient large language and vision model family that enhances learning capacity by temporarily increasing latent dimensions, achieving high performance with fewer parameters.
Contribution
The paper proposes Phantom, a new LLVM architecture that improves efficiency by increasing latent dimensions during attention, combined with optimization techniques, outperforming larger models.
Findings
Phantom models outperform larger open- and closed-source LLVMs.
Temporary latent dimension increase enhances vision-language understanding.
Efficient models with fewer parameters achieve competitive performance.
Abstract
The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we…
Peer Reviews
Decision·Submitted to ICLR 2025
The authors present an efficient LLVM family Phantom with enhanced learning capabilities within limited model sizes. They introduce Phantom Optimization (PO) that seems interesting. Phantom demonstrates good performance in their evaluations.
1. Authors only compare with other open-sourced Multimodal LLMs (MLLMs) using their checkpoints. They lack comparisons with related baseline methods using the same pre-trained models, datasets, and training configurations. There are several baseline methods [1,2,3,4,5] that contribute to the training algorithms of MLLMs. Besides, there are also huge amounts of works talking about how to modify the attention mechanism in Transformer models that need to be discussed. 2. It seems that the authors
**Comprehensive evaluation** The authors evaluate their VLM architecture + dataset + training strategy on a variety of vision-language benchmarks, and show good performance compared to various baselines (both open and closed models) **Method ablations** I appreciated how the authors ablate several components of their proposed contribution and report performance over multiple tasks to study how these components add to the overall performance: (1) the weighted-average mechanism for combining
**Method motivation / understanding** While the authors say that increased feature dimension size in the attention interactions is important for improved quality (“to make LLVMs embed much more vision-language knowledge”; L244-245), I still don’t see the motivation for concatenating the cross-attention and self-attention projections. To increase dimensionality while preserving parameter-count, we could also arbitrarily repeat or expand the features from `head_dimension` to `2 * head_dimension`
* Strong results on a range of benchmarks. * A set of ablations that clearly demonstrate which of the introduced components is responsible for the improved performance. * The proposed architectural modification to the MHSA module highlights the importance of further exploring VLLM architectures beyond the native transformer.
**Major** * "Phantom dimension" - the main of the two contributions of the paper is not described in an accessible manner in the paper, which significantly limits the ability to evaluate the paper and its potential impact. Consider adding pseudocode to describe what exact this modification does. * It appears that the authors (primarily) used open source components to develop their models, however, it's unclear from the write up whether their models will similarly be made open source. **Minor**
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
