LM4LV: A Frozen Large Language Model for Low-level Vision Tasks
Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong

TL;DR
This paper introduces LM4LV, a framework that leverages a frozen large language model to effectively perform various low-level vision tasks without additional training or multi-modal data, bridging a gap in current MLLMs.
Contribution
The work demonstrates that a frozen LLM can be adapted to low-level vision tasks, which was previously unexplored, without requiring multi-modal data or prior training.
Findings
LM4LV successfully performs low-level vision tasks.
The approach does not require multi-modal data or prior training.
It bridges the gap between large language models and low-level vision tasks.
Abstract
The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose , a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new…
Peer Reviews
Decision·Submitted to ICLR 2025
(1) It proves frozen LLM can solve low-level vision task (2) The ablation experiment is sufficient to answer that the processing of low-level information is not due to the trainable linear layer, but the text pre-training plays a role
vision encoder selection is relatively small, exploring more vision encoders will be more convincing
+ The proposed method is simple and easy to understand. + This work is the first to use LLM for low-level vision tasks. + Some conclusions in the paper are interesting. E.g. MAE visual tokens are robust to rotation.
- It would be helpful if the authors could specify the number of visual tokens generated by MAE for each image. Moreover, the discussion about computation cost is missing. - While in sec. 4.1, the authors use MAE-r model as a baseline. Since MAE-r is trained for image reconstruction only, the baseline is not very strong. To establish a stronger baseline, the authors can consider adding the linear adapters into MAE-r, and train the linear adapters for image restoration tasks. Ideally, the linear
Pro: 1. The paper tackles an interesting research direction by investigating whether frozen LLMs can handle low-level vision tasks without multi-modal training, addressing a significant gap in current MLLM research. 2. The authors propose LM4LV, an efficient framework that achieves impressive results across multiple low-level vision tasks using only two trainable linear layers while keeping the LLM frozen. 3. The paper provides thorough empirical analysis through comprehensive ablation studies,
Cons: 1. The paper said, “Furthermore, we cancel the causal attention mask and the ROPE position embedding in the forward process, as they are not the common practice for vision modules.”. However, the ROPE and it variant 2D-ROPE are widely used in large vision transformers (e.g., EVA). This sentence needs revision, and additional experiments with position embeddings would strengthen the analysis. 2. The paper does not explore different LLM variants and sizes. While testing very large LLMs may b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Topic Modeling
