Universal Retrieval for Multimodal Trajectory Modeling
Xuan Zhang, Ziyan Jiang, Rui Meng, Yifei Leng, Zhenbang Xiao, Zora Zhiruo Wang, Yanyi Shang, Dehan Kong

TL;DR
This paper introduces a new multimodal trajectory retrieval framework that leverages vision-language models and contrastive learning, significantly improving retrieval performance across diverse real-world scenarios.
Contribution
It presents GAE-Retriever, a novel multimodal retrieval framework, along with the UATD dataset and GAE-Bench benchmark, advancing trajectory modeling and retrieval in AI agents.
Findings
GAE-Retriever outperforms baselines in retrieval recall
Constructed the large-scale UATD dataset for trajectory data
Established GAE-Bench for benchmarking retrieval methods
Abstract
Trajectory data, capturing human actions and environmental states across various modalities, holds significant potential for enhancing AI agent capabilities, particularly in GUI environments. However, how to model the representation of trajectory-level data presents a significant challenge that has not been systematically addressed amid explosive trajectory data growth. In this work, we introduce Multimodal Trajectory Retrieval, bridging the gap between universal retrieval and agent-centric trajectory modeling. We construct the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and states across diverse real-world scenarios. Based on this, we present GAE-Bench, a benchmark containing a large number of trajectory-based retrieval pairs. In addition, we propose GAE-Retriever, a multimodal retrieval framework that adopts vision-language models and incorporates optimized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Management and Algorithms · Spatial Cognition and Navigation
