ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Bangya Liu; Xinyu Gong; Zelin Zhao; Ziyang Song; Yulei Lu; Suhui Wu; Jun Zhang; Suman Banerjee; Hao Zhang

arXiv:2512.22854·cs.CV·March 27, 2026

ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu, Suhui Wu, Jun Zhang, Suman Banerjee, Hao Zhang

PDF

Open Access 1 Datasets

TL;DR

ByteLoom introduces a novel diffusion transformer framework for generating human-object interaction videos that maintain geometric consistency across views, using a new cache mechanism and curriculum learning to reduce annotation reliance.

Contribution

The paper presents ByteLoom, a diffusion transformer-based approach with a Relative Coordinate Map cache and a progressive curriculum training strategy for improved HOI video generation.

Findings

01

Achieves geometrically consistent multi-view HOI videos.

02

Reduces dependency on detailed hand mesh annotations.

03

Maintains human identity and smooth motion in generated videos.

Abstract

Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

byteloom-HOI/Mani4D_test
dataset· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning