Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
Grant Wardle, Teo Susnjak

TL;DR
This study investigates how the order of images and text in multi-modal prompts affects large language models' reasoning accuracy, revealing that sequencing impacts performance more in simple tasks than complex ones, and emphasizing prompt structure importance.
Contribution
It provides empirical evidence on the influence of modality sequencing in multi-modal prompts and highlights the significance of prompt structure in complex reasoning tasks.
Findings
Modality order significantly affects simple task accuracy.
Sequencing impact diminishes in complex, multi-image reasoning tasks.
Prompt structure and logical flow are crucial for multi-modal reasoning.
Abstract
This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi-step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
