TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

Zhenyang Liu; Yongchong Gu; Sixiao Zheng; Yanwei Fu; Xiangyang Xue; Yu-Gang Jiang

arXiv:2507.01424·cs.RO·October 14, 2025

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

Zhenyang Liu, Yongchong Gu, Sixiao Zheng, Yanwei Fu, Xiangyang Xue, Yu-Gang Jiang

PDF

3 Reviews

TL;DR

TriVLA introduces a novel triple-system architecture incorporating episodic world modeling, enabling robots to better understand, recall, and predict environmental dynamics for improved generalization and long-horizon planning in complex tasks.

Contribution

The paper presents one of the first formalized episodic world models in vision-language-action frameworks, integrating multimodal grounding, dynamic perception, and episodic memory for enhanced robot control.

Findings

01

Outperforms baseline models on standard benchmarks.

02

Operates efficiently at approximately 36 Hz.

03

Demonstrates strong long-horizon planning and open-ended understanding.

Abstract

Recent advances in vision-language models (VLMs) have enabled robots to follow open-ended instructions and demonstrate impressive commonsense reasoning. However, current vision-language-action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short-horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we propose, to our knowledge, one of the first formalized episodic world models in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple-system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

- The paper presents a coherent and well-motivated triple-system framework combining vision–language understanding and dynamic modeling, extending the traditional dual-system VLA paradigm. - Results on multiple benchmarks (CALVIN, LIBERO, MetaWorld) are systematically compared against recent SOTA methods, showing consistent improvements. - The inference design of System 3 (single forward pass instead of full denoising) enables real-time operation at 36 Hz, demonstrating good computational effi

Weaknesses

- The introduction of the world model is the paper’s central claim, yet Table 4 does not include results for EMP + L-Policy without EDP. Without this comparison, the contribution of System 3 remains unconvincing. It would also be helpful to report CALVIN per-task scores (1–5) rather than only the average length, and to conduct similar ablations on LIBERO. - The real-world experiments lack any quantitative metrics—only qualitative demonstrations are shown, which weakens the empirical evidence fo

Reviewer 02Rating 4Confidence 3

Strengths

The paper identifies a critical and widely recognized limitation of current robotic policies: their struggle with long-horizon reasoning due to a reliance on "static" representations . The analogy to cognitive neuroscience and "episodic memory" provides an intuitive motivation for an architecture that doesn't just perceive the present but also predicts the future. While the individual components (VLMs, VDMs, diffusion policies) are existing technologies, their explicit synthesis in this paralle

Weaknesses

The paper's most significant weakness is the ablation study in Table 4. The paper's entire argument is that its "Triple-System" (VLM+VDM+Policy) is superior to the "Dual-System" (VLM+Policy) it critiques in Figure 2. To prove this, the most crucial ablation baseline would be EMP + L-Policy (i.e., TriVLA without System 3 / EDP). This baseline is missing. Without it, the authors have not empirically demonstrated that adding the VDM (System 3) is superior to the "static" VLA model they aim to impro

Reviewer 03Rating 4Confidence 3

Strengths

The paper’s presentation of its method is overall clear. The paper obtains exception results when compared with baselines that incorporate future prediction (Table 1&2).

Weaknesses

1. The paper lacks a problem statement/setup or an evaluation protocol. Section 3 introduces only the VLA mode; but what is the problem? Is it supervised learning/imitation learning? If so, how are we evaluating it (train-and-test?) What is exactly “Zero-shot long-horizon evaluation” in Table 1’s caption? 2. System 3 is the major technical novelty, yet the paper lacks transparency on what data was used to fine-tune it. Section 4.2 mentions using “self-collected data” for fine-tuning but is uncl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.