Implicit State Estimation via Video Replanning
Po-Chen Ko, Jiayuan Mao, Yu-Hsiang Fu, Hsien-Jeng Yeh, Chu-Rong Chen, Wei-Chiu Ma, Yilun Du, Shao-Hua Sun

TL;DR
This paper presents a novel video-based planning framework that dynamically updates and filters plans during interaction, enabling implicit state estimation and improved adaptability in complex manipulation tasks.
Contribution
It introduces an online updating method for video-based planning that filters failed plans, allowing implicit state estimation without explicit modeling of unknown variables.
Findings
Enhanced replanning performance in simulated manipulation tasks
Effective filtering of failed plans during interaction
Demonstrated adaptability in partially observed environments
Abstract
Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive…
Peer Reviews
Decision·Submitted to ICLR 2026
- The problem formulation is clear and well-motivated. The problem is important to the field. I especially liked the example given in the introduction of opening a door without knowing whether it should be pushed or pulled. This made it really clear what the conceptual challenge was. - The approach is interesting. - The experimental domains seem reasonably interesting/challenging. - The paper proposes a new benchmark (Meta-World System Identification Benchmark) with 5 manipulation-style tasks
There were many parts of the paper that I found lacked sufficient detail. - In Sec. 4.1, in the Representing state embeddings section, it is stated that “we preprocess the dataset by grouping entries by object ID and selecting a successful trial as the canonical embedding e^o_j”. I don’t understand what these “canonical embeddings” are or how they are used, as the paper never explicitly revisits how these canonical embeddings are stored, updated, or surfaced at inference time, which made it h
strengths 1: Novel implicit adaptation mechanism that integrates past failures into video-based planning without requiring explicit parameter estimation or belief models. 2: Comprehensive evaluation, including a new benchmark, ablations, and real-world experiments, showing strong performance and practical feasibility. 3: Modular and general framework that enhances existing video planning pipelines and leverages diffusion models for flexible plan generation.
Weaknesses 1: Computational overhead: Online embedding refinement and multiple video plan generations increase inference cost, limiting scalability in real-time or resource-constrained settings. 2: Limited theoretical analysis: While effective empirically, the method lacks formal guarantees on convergence, stability, or conditions under which implicit state estimation succeeds. 3: Dependence on visual similarity and pixel-space distances for plan rejection may struggle in cluttered scenes or
The paper addresses the interesting problem of the inability to adapt to failures and reason about uncertainty during interaction. It provides a framework that leverages past interaction videos and a simple rejection mechanism to guide future plans, mimicking human trial-and-error, while other approaches use complex explicit belief models. The paper provides a thorough experimental evaluation, but only in two scenarios: their new simulation benchmark (Meta-World System Identification) and the r
The framework attributes all failures to planning errors, assuming a near-perfect action module that can always execute the generated video plan. This omits real-world execution noise and control uncertainties, which could cause failures regardless of plan quality. The method relies solely on vision. It may struggle with tasks where the crucial state parameter is visually ambiguous (e.g., two objects with identical appearance but different masses) or requires other sensory modalities like touch
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Robotic Path Planning Algorithms · Reinforcement Learning in Robotics
