GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models
Mukai Li, Lei Li, Shansan Gong, Qi Liu

TL;DR
GIRAFFE introduces a systematic approach to extend the context length of Visual Language Models, achieving state-of-the-art results on long-range benchmarks while maintaining short context performance.
Contribution
The paper presents GIRAFFE, a method that extends VLM context length to 128K, with new data curation, improved position extension, and hybrid-resolution training, surpassing existing open-source models.
Findings
GIRAFFE achieves 128K context length extension.
It outperforms similar open-source long VLMs on benchmarks.
It is competitive with GPT-4V in long context tasks.
Abstract
Visual Language Models (VLMs) demonstrate impressive capabilities in processing multimodal inputs, yet applications such as visual agents, which require handling multiple images and high-resolution videos, demand enhanced long-range modeling. Moreover, existing open-source VLMs lack systematic exploration into extending their context length, and commercial models often provide limited details. To tackle this, we aim to establish an effective solution that enhances long context performance of VLMs while preserving their capacities in short context scenarios. Towards this goal, we make the best design choice through extensive experiment settings from data curation to context window extending and utilizing: (1) we analyze data sources and length distributions to construct ETVLM - a data recipe to balance the performance across scenarios; (2) we examine existing position extending methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
