MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT   Prompting

Avinash Anand; Janak Kapuriya; Apoorv Singh; Jay Saraf; Naman Lal,; Astha Verma; Rushali Gupta; Rajiv Shah

arXiv:2404.08704·cs.CL·April 16, 2024·2 cites

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal,, Astha Verma, Rushali Gupta, Rajiv Shah

PDF

Open Access

TL;DR

This paper introduces MM-PhyQA, a high school-level multimodal physics question dataset, and proposes MI-CoT prompting, demonstrating improved multimodal reasoning performance with a maximum accuracy of 71.65%.

Contribution

The paper creates a new multimodal physics dataset and introduces MI-CoT prompting, enhancing LLMs' multi-image reasoning capabilities.

Findings

01

LLaVA-1.5 with MI-CoT achieved the highest accuracy of 71.65%.

02

Multimodal prompting significantly improves physics reasoning performance.

03

Fine-tuned models outperform zero-shot GPT-4 on the dataset.

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Dropout · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Label Smoothing