GIRAFFE: Design Choices for Extending the Context Length of Visual   Language Models

Mukai Li; Lei Li; Shansan Gong; Qi Liu

arXiv:2412.12735·cs.CV·December 18, 2024

GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

Mukai Li, Lei Li, Shansan Gong, Qi Liu

PDF

Open Access 1 Repo

TL;DR

GIRAFFE introduces a systematic approach to extend the context length of Visual Language Models, achieving state-of-the-art results on long-range benchmarks while maintaining short context performance.

Contribution

The paper presents GIRAFFE, a method that extends VLM context length to 128K, with new data curation, improved position extension, and hybrid-resolution training, surpassing existing open-source models.

Findings

01

GIRAFFE achieves 128K context length extension.

02

It outperforms similar open-source long VLMs on benchmarks.

03

It is competitive with GPT-4V in long context tasks.

Abstract

Visual Language Models (VLMs) demonstrate impressive capabilities in processing multimodal inputs, yet applications such as visual agents, which require handling multiple images and high-resolution videos, demand enhanced long-range modeling. Moreover, existing open-source VLMs lack systematic exploration into extending their context length, and commercial models often provide limited details. To tackle this, we aim to establish an effective solution that enhances long context performance of VLMs while preserving their capacities in short context scenarios. Towards this goal, we make the best design choice through extensive experiment settings from data curation to context window extending and utilizing: (1) we analyze data sources and length distributions to construct ETVLM - a data recipe to balance the performance across scenarios; (2) we examine existing position extending methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kiaia/GIRAFFE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques