Hierarchical Instruction-aware Embodied Visual Tracking

Kui Wu; Hao Chen; Churan Wang; Fakhri Karray; Zhoujun Li; Yizhou Wang; Fangwei Zhong

arXiv:2505.20710·cs.CV·May 28, 2025

Hierarchical Instruction-aware Embodied Visual Tracking

Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong

PDF

Open Access 3 Reviews

TL;DR

This paper introduces HIEVT, a hierarchical agent that improves embodied visual tracking by translating user instructions into spatial goals and using reinforcement learning for robust, generalizable target tracking across diverse environments.

Contribution

The paper proposes a novel hierarchical framework that bridges instruction understanding and action generation using spatial goals, enhancing generalization and robustness in embodied visual tracking.

Findings

01

HIEVT outperforms existing methods in unseen environments.

02

The model demonstrates robustness across diverse target dynamics.

03

Real-world deployment confirms effectiveness and generalizability.

Abstract

User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper studies a very important problem, bridging the gap between user-friendly goals and goals for Embodied Visual Tracking models. 2. The proposed method, Embodied Visual Tracking task, is very elegant and well-established. 3. The evaluation of the paper shows the proposed method is very effective.

Weaknesses

However, there are two severe issues with the writing in this paper: 1. The motivation and the method are not matched. In the Introduction (lines 62-74), the authors listed three weaknesses of current models: 1) Limited Comprehension, 2) Limited Generalization on unseen data, and 3) Inference Latency. However, the proposed method seems to have only addressed the weakness of Limited Comprehension, ignoring the other two. 2. The whole paper misused the commands **\citet** and **\citep.** For exam

Reviewer 02Rating 2Confidence 3

Strengths

- The asynchronous planning strategy sounds reasonable and has potential to keep the building components with better ones. - The approach achieves strong performance over various baselines. - The paper is generally well-structured and easy to follow.

Weaknesses

- The authors argue that the user-centric EVT is one of their contributions, but its description is rather unclear and even in Appendix. How do we define the "user-centric" EVT and how is it different from previous EVT setups? - Following targets has already been explored in prior work such as Puig et al. 2024. This paper aims to follow humans, but simple extension from humans to animals can be done by modifying its object meshes. Can the authors the difference from this? - Puig et al., "Habi

Reviewer 03Rating 4Confidence 4

Strengths

1. The combination of LLM-based semantic reasoning with RL-based low-level control is novel and well-motivated. The use of parallel modules (CoT for spatial reasoning and RAG for memory augmentation) enables interpretable goal generation and efficient decision-making. 2. Strong experimental results and real-time performance. The proposed method outperforms both traditional and state-of-the-art VLA-based baselines in multiple simulated and real-world environments, achieving near-perfect success

Weaknesses

1. The method is only tested on the authors’ self-constructed dataset, without comparison on existing EVT benchmarks such as Gym-UnrealCV. This makes it difficult to assess how well the approach generalizes beyond their setup and gives the impression that the model may be overfit to the custom environment. 2. Unclear motivation and missing ablation study for design choices. The paper introduces several components but doesn’t clearly explain their purpose or justify their inclusion through ablati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Video Surveillance and Tracking Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings