Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee; Jose Camacho-Collados; Mohammad Taher Pilehvar

arXiv:2511.10457·cs.CL·November 14, 2025

Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar

PDF

Open Access

TL;DR

This paper evaluates the ability of large language models to perform state tracking across entities, highlighting strengths of recent models like GPT-4 and Llama3, especially with Chain of Thought, and identifying limitations of earlier models over multiple steps.

Contribution

It introduces a benchmark for state tracking tasks and analyzes the performance of various LLMs, revealing the capabilities and limitations of recent and earlier models.

Findings

01

GPT-4 and Llama3 excel in state tracking with Chain of Thought.

02

Earlier models initially understand tasks but fail over multiple steps.

03

Recent models maintain performance longer than older ones.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques