From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Yujie Lu; Xiujun Li; Tsu-Jui Fu; Miguel Eckstein; William Yang Wang

arXiv:2405.14213·cs.CV·August 27, 2024

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang

PDF

Open Access 1 Repo

TL;DR

SEEKER is a novel multimodal large language model that efficiently encodes long textual information into visual pixel space, significantly improving long-context understanding and outperforming existing models in multimodal tasks.

Contribution

The paper introduces SEEKER, a new approach that compresses long text into images for better long-context processing in MLLMs, addressing a key limitation of current models.

Findings

01

SEEKER outperforms existing MLLMs in long-context tasks.

02

SEEKER uses fewer image tokens to encode the same textual information.

03

SEEKER demonstrates superior efficiency and accuracy in long-form multimodal understanding.

Abstract

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yujielu10/seeker
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies