FCoT-VL:Advancing Text-oriented Large Vision-Language Models with   Efficient Visual Token Compression

Jianjian Li; Junquan Fan; Feng Tang; Gang Huang; Shitao Zhu; Songlin; Liu; Nian Xie; Wulong Liu; Yong Liao

arXiv:2502.18512·cs.CV·February 27, 2025

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

Jianjian Li, Junquan Fan, Feng Tang, Gang Huang, Shitao Zhu, Songlin, Liu, Nian Xie, Wulong Liu, Yong Liao

PDF

Open Access

TL;DR

This paper introduces FCoT-VL, an efficient visual token compression framework for high-resolution, text-oriented vision-language models, significantly reducing computation while maintaining or improving performance.

Contribution

It proposes a novel self-distillation pre-training and post-training framework for visual token compression in text-oriented VLLMs, addressing performance degradation issues.

Findings

01

Reduces computational overhead in high-resolution VLLMs

02

Outperforms baseline models on text-oriented benchmarks

03

Requires limited image-text pairs for training

Abstract

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques