Probing a Vision-Language-Action Model for Symbolic States and   Integration into a Cognitive Architecture

Hong Lu; Hengxu Li; Prithviraj Singh Shahani; Stephanie Herbers,; Matthias Scheutz

arXiv:2502.04558·cs.RO·February 10, 2025

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Hengxu Li, Prithviraj Singh Shahani, Stephanie Herbers,, Matthias Scheutz

PDF

Open Access

TL;DR

This paper investigates how a vision-language-action model encodes symbolic information and integrates it into a cognitive architecture to improve interpretability and robustness in robotic manipulation.

Contribution

It uncovers symbolic representations within OpenVLA's layers and demonstrates their integration into a cognitive architecture for enhanced interpretability.

Findings

01

High accuracy (> 0.90) in encoding object and action states across layers.

02

No observed pattern of earlier encoding of object states compared to action states.

03

Successful real-time state monitoring using symbolic representations.

Abstract

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsLLaMA