Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

Xinyu Wei; Guoli Yang; Jialu Zhou; Mingyue Yang; Leqian Li; Kedi Zhang; Chunping Qiu

arXiv:2508.17638·cs.CV·August 26, 2025

Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

Xinyu Wei, Guoli Yang, Jialu Zhou, Mingyue Yang, Leqian Li, Kedi Zhang, Chunping Qiu

PDF

TL;DR

DEHVF introduces a dynamic hierarchical visual feature embedding method for vision-language models, reducing computational costs while enhancing fine-grained semantic alignment and achieving superior benchmark performance.

Contribution

It presents a novel hierarchical visual feature fusion approach that dynamically aligns visual and language representations without increasing sequence length.

Findings

01

Outperforms existing PEFT methods on VL benchmarks

02

Maintains efficient training and inference

03

Achieves higher accuracy in VQA and image captioning

Abstract

Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs). However, this paradigm leads to a significant increase in the length of the input sequence, resulting in substantial computational overhead. Existing methods attempt to fuse visual information into the intermediate layers of LLMs, which alleviate the sequence length issue but often neglect the hierarchical semantic representations within the model and the fine-grained visual information available in the shallower visual encoding layers. To address this limitation, we propose DEHVF, an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features. Its core lies in leveraging the inherent hierarchical representation characteristics of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.