Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding

Oriol Barbany; Adri\`a Colom\'e; Carme Torras

arXiv:2505.07600·cs.RO·May 13, 2025

Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding

Oriol Barbany, Adri\`a Colom\'e, Carme Torras

PDF

Open Access

TL;DR

This paper introduces BiFold, a model that uses temporal context and end-to-end learning to improve garment state estimation and manipulation in cloth folding tasks, especially in challenging scenarios.

Contribution

It demonstrates how incorporating temporal context enhances visual-language models for cloth manipulation, enabling better state understanding and action prediction.

Findings

01

Temporal context improves garment state estimation.

02

BiFold aligns text and image regions effectively.

03

Model maintains temporal consistency in predictions.

Abstract

Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Materials and Mechanics · Robot Manipulation and Learning