Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal   Learning

Alex Jinpeng Wang; Linjie Li; Yiqi Lin; Min Li; Lijuan Wang; Mike; Zheng Shou

arXiv:2406.02547·cs.CV·June 5, 2024·2 cites

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike, Zheng Shou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VisInContext, a method that uses visual tokens to efficiently extend in-context text length in multi-modal models, reducing resource costs and improving downstream task performance.

Contribution

It presents a novel technique that significantly increases in-context text length in multi-modal models using visual tokens, with minimal additional computational costs.

Findings

01

Expands in-context text length from 256 to 2048 tokens.

02

Achieves superior downstream benchmark performance.

03

Enhances document understanding and retrieval capabilities.

Abstract

Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/VisInContext
pytorchOfficial

Videos

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Open Education and E-Learning · Speech and dialogue systems

MethodsMixture of Experts