Token-Efficient Multimodal Reasoning via Image Prompt Packaging
Joong Ho Choi, Jiayang Zhao, Avani Appalla, Himansh Mukesh, Dhwanil Vasani, Boyi Qian

TL;DR
This paper introduces Image Prompt Packaging (IPPg), a method embedding structured text into images to significantly reduce inference costs in multimodal models while maintaining competitive accuracy across various tasks.
Contribution
IPPg is a novel prompting paradigm that embeds text into images, enabling substantial token and cost savings in multimodal reasoning models.
Findings
IPPg achieves 35.8--91.0% inference cost reductions.
Token compression of up to 96% with maintained accuracy in many settings.
Model- and task-dependent outcomes, with specific error vulnerabilities identified.
Abstract
Deploying large multimodal language models at scale is constrained by token-based inference costs, yet the cost-performance behavior of visual prompting strategies remains poorly characterized. We introduce Image Prompt Packaging (IPPg), a prompting paradigm that embeds structured text directly into images to reduce text token overhead, and benchmark it across five datasets, three frontier models (GPT-4.1, GPT-4o, Claude 3.5 Sonnet), and two task families (VQA and code generation). We derive a cost formulation decomposing savings by token type and show IPPg achieves 35.8--91.0\% inference cost reductions. Despite token compression of up to 96\%, accuracy remains competitive in many settings, though outcomes are highly model- and task-dependent: GPT-4.1 achieves simultaneous accuracy and cost gains on CoSQL, while Claude 3.5 incurs cost increases on several VQA benchmarks. Systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
