NEWTON: Are Large Language Models Capable of Physical Reasoning?

Yi Ru Wang; Jiafei Duan; Dieter Fox; Siddhartha Srinivasa

arXiv:2310.07018·cs.CL·October 12, 2023·1 cites

NEWTON: Are Large Language Models Capable of Physical Reasoning?

Yi Ru Wang, Jiafei Duan, Dieter Fox, Siddhartha Srinivasa

PDF

Open Access

TL;DR

This paper introduces NEWTON, a comprehensive benchmark and repository to evaluate large language models' physical reasoning abilities, highlighting their strengths and limitations in understanding everyday objects and attributes.

Contribution

The paper presents a new benchmark and dataset for assessing physical reasoning in LLMs, along with a pipeline for domain-specific customization, filling a gap in existing evaluation methods.

Findings

01

GPT-4 shows strong scenario-based reasoning capabilities.

02

LLMs are less consistent than humans in object-attribute reasoning.

03

The benchmark enables targeted evaluation and improvement of physical reasoning in LLMs.

Abstract

Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Linear Layer · Label Smoothing · Residual Connection · Adam · Absolute Position Encodings · Layer Normalization