From Text to Pixel: Advancing Long-Context Understanding in MLLMs
Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang

TL;DR
SEEKER is a novel multimodal large language model that efficiently encodes long textual information into visual pixel space, significantly improving long-context understanding and outperforming existing models in multimodal tasks.
Contribution
The paper introduces SEEKER, a new approach that compresses long text into images for better long-context processing in MLLMs, addressing a key limitation of current models.
Findings
SEEKER outperforms existing MLLMs in long-context tasks.
SEEKER uses fewer image tokens to encode the same textual information.
SEEKER demonstrates superior efficiency and accuracy in long-form multimodal understanding.
Abstract
The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
