The Impact of Element Ordering on LM Agent Performance
Wayne Chi, Ameet Talwalkar, Chris Donahue

TL;DR
This paper investigates how the order of UI elements affects language model agents' performance in virtual environments, revealing that proper ordering significantly improves task completion, especially in pixel-based settings.
Contribution
It demonstrates the importance of element ordering for agent performance, introduces a dimensionality reduction method for ordering in pixel environments, and improves task success rates in a benchmark.
Findings
Randomizing element order degrades performance similarly to removing text.
Dimensionality reduction provides effective ordering in pixel-only environments.
The proposed method doubles task completion rates on the OmniACT benchmark.
Abstract
There has been a surge of interest in language model agents that can navigate virtual environments such as the web or desktop. To navigate such environments, agents benefit from information on the various elements (e.g., buttons, text, or images) present. It remains unclear which element attributes have the greatest impact on agent performance, especially in environments that only provide a graphical representation (i.e., pixels). Here we find that the ordering in which elements are presented to the language model is surprisingly impactful--randomizing element ordering in a webpage degrades agent performance comparably to removing all visible text from an agent's state representation. While a webpage provides a hierarchical ordering of elements, there is no such ordering when parsing elements directly from pixels. Moreover, as tasks become more challenging and models more sophisticated,…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper provides the first in-depth investigation of how element ordering affects agent performance and demonstrates its significance. 2. The proposed dimensionality reduction-based ordering method performs well in pixel-only scenarios and achieves new state-of-the-art performance on the OmniACT agent benchmark. 3. The paper introduces a UI element detection model which has been made publicly available for other researchers to use and improve upon.
1. While the paper focuses on the t-SNE dimensionality reduction ordering method, it lacks in-depth analysis and comparison with other ordering methods. Additionally, all ordering methods show significant performance gaps compared to Pre-ordering (Table 4). 2. The paper briefly introduces the training process of the UI element detection model but lacks more detailed specifics.
1. Reveals the Critical Impact of Element Order on Agent Performance: Through systematic experiments, the paper effectively demonstrates the significant influence of element ordering on language model agents operating in pixel-only environments. This finding presents a new perspective for developing efficient virtual environment navigation algorithms. Previous research has often concentrated on accuracy in image recognition and text analysis, overlooking the importance of element order in contex
1. While the baseline of random ordering is understandable as detrimental to large language models in interpreting UI, it is overly simplistic and is not a strong baseline. The study would benefit from incorporating more heuristic baselines. For instance, could a vision-language model, such as GPT-4V, assist in determining an optimal ordering when seeing the UI directly? 2. In Table 6, the proposed method consistently outperforms other ordering techniques only when elements are detected using F
- Originality: The paper tackles a relatively unexplored area—optimizing UI element ordering for LM agents in pixel-only environments. The use of t-SNE for ordering based on spatial relationships is a novel application, offering a fresh perspective on improving agent navigation performance. - Quality: The research employs ablation studies on VisualWebArena and OmniACT. The methodological depth and comparison across multiple ordering methods highlight the improvement of the approach. - Clarity: T
- The paper could clarify its discussion of the dimensionality reduction approach, specifically addressing the parameters used in t-SNE. As t-SNE can be sensitive to parameter tuning, a detailed analysis of how parameter choices affect ordering outcomes would strengthen the validity of the results. - Another weakness is the limited exploration of alternative ordering methods. While t-SNE provides performance gains, the study would benefit from a broader examination of ordering techniques, especi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Robotic Path Planning Algorithms
