Can Language Models Understand Physical Concepts?
Lei Li, Jingjing Xu, Qingxiu Dong, Ce Zheng, Qi Liu, Lingpeng Kong, Xu, Sun

TL;DR
This paper investigates whether language models can understand physical concepts, introducing a benchmark VEC, analyzing their performance, and proposing a method to transfer embodied knowledge from vision-language models to improve understanding.
Contribution
The paper introduces the VEC benchmark for physical concept understanding, analyzes LM performance, and proposes a distillation method to transfer embodied knowledge from vision-language models.
Findings
Scaling up LMs improves understanding of some visual concepts.
Vision-augmented LMs like CLIP and BLIP achieve human-level understanding of embodied concepts.
A distillation method transfers embodied knowledge, boosting LM performance significantly.
Abstract
Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85\% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training
