Rethinking Visual Information Processing in Multimodal LLMs
Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C

TL;DR
This paper introduces LLaViT, a novel approach where large language models are extended to serve as vision encoders, significantly improving multimodal vision-language task performance by integrating visual processing directly into the LLM.
Contribution
LLaViT enables LLMs to function as vision encoders through three key modifications, enhancing multimodal integration and outperforming existing methods like LLaVA.
Findings
LLaViT outperforms LLaVA on multiple benchmarks.
LLaViT surpasses larger models with fewer parameters.
Bidirectional attention improves visual feature integration.
Abstract
Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is clearly written, and the structure is easy to follow. 2. The overall computational pipeline is straightforward to implement. 3. The motivation of improving visual emphasis and token weighting is generally relevant.
1. The claimed contribution based on the feature re-weighting strategy reads more like an intuitive design choice rather than a meaningful research idea. 2. The approach resembles common practices in project work, adjusting feature emphasis and reporting slightly improved accuracy without introducing new perspectives on the problem. 3. The paper does not discuss several closely related token aggregation approaches (e.g., DeepStack) and decouple transformers (e.g., MoT, EVEv2, Bagel, Mono-Inter
1. Originality: Reframing LLMs as vision encoders is genuinely novel. The three targeted modifications collectively close key visual-processing gaps, and the “visual token translation” mechanism (Sec. 2.2) offers fresh insight into MLLM internals. 2. Quality: Evaluation is comprehensive—17 benchmarks, multiple LLM families (Qwen2.5, Phi-3.5), and both Standard/Any-Res settings. Ablations (Tab. 3) and qualitative analyses (Figs. 3, 5–6) rigorously validate each component. 3. Clarity: The expositi
1. Baseline diversity: Comparisons stop at LLaVA-1.5; missing head-to-head results with recent MLLMs (e.g., Qwen-VL, InternVL) undercut claims of state-of-the-art performance. 2. Computational overhead: While parameter growth (5–12%, Tab. 4) and FLOPs (Tab. 5) are reported, end-to-end latency, peak memory, and throughput under realistic batch sizes/resolutions are not quantified—crucial for deployment. 3. Theoretical grounding: The benefit of semantic alignment is asserted but not explained; “vi
- Clear conceptual motivation: reinterprets the LLM as part of the vision pipeline rather than a downstream consumer of features. - Introduces three simple yet effective modifications, each empirically validated with clear ablations. - Achieves substantial gains across diverse multimodal benchmarks compared to llava 1.5. - Solid ablations on multiple LLM scales, both standard and high-resolution settings, and consistent methodology. - Qualitative results provide intuitive evidence that vi
- The absolute performance remains relatively low compared to state-of-the-art models. While this is understandable given the limited dataset and compute budget, it raises questions about scalability. Can LLaViT be extended as a fine-tuning method for existing high-end MLLMs such as the Qwen-VL or InternVL series? Demonstrating compatibility with frontier models would substantially enhance the practical usability and relevance of this approach. - The paper provides little discussion on the text
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
