Token-Efficient Multimodal Reasoning via Image Prompt Packaging

Joong Ho Choi; Jiayang Zhao; Avani Appalla; Himansh Mukesh; Dhwanil Vasani; Boyi Qian

arXiv:2604.02492·cs.CV·April 6, 2026

Token-Efficient Multimodal Reasoning via Image Prompt Packaging

Joong Ho Choi, Jiayang Zhao, Avani Appalla, Himansh Mukesh, Dhwanil Vasani, Boyi Qian

PDF

TL;DR

This paper introduces Image Prompt Packaging (IPPg), a method embedding structured text into images to significantly reduce inference costs in multimodal models while maintaining competitive accuracy across various tasks.

Contribution

IPPg is a novel prompting paradigm that embeds text into images, enabling substantial token and cost savings in multimodal reasoning models.

Findings

01

IPPg achieves 35.8--91.0% inference cost reductions.

02

Token compression of up to 96% with maintained accuracy in many settings.

03

Model- and task-dependent outcomes, with specific error vulnerabilities identified.

Abstract

Deploying large multimodal language models at scale is constrained by token-based inference costs, yet the cost-performance behavior of visual prompting strategies remains poorly characterized. We introduce Image Prompt Packaging (IPPg), a prompting paradigm that embeds structured text directly into images to reduce text token overhead, and benchmark it across five datasets, three frontier models (GPT-4.1, GPT-4o, Claude 3.5 Sonnet), and two task families (VQA and code generation). We derive a cost formulation decomposing savings by token type and show IPPg achieves 35.8--91.0\% inference cost reductions. Despite token compression of up to 96\%, accuracy remains competitive in many settings, though outcomes are highly model- and task-dependent: GPT-4.1 achieves simultaneous accuracy and cost gains on CoSQL, while Claude 3.5 incurs cost increases on several VQA benchmarks. Systematic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.