TL;DR
PoseGen is a novel framework for long human video generation that maintains identity and motion control using in-context LoRA finetuning and a segment-interleaved strategy, trained on a small dataset.
Contribution
It introduces in-context LoRA finetuning for identity and pose control, and a segment-interleaved generation method to extend video length.
Findings
Outperforms state-of-the-art in identity fidelity, pose accuracy, and temporal consistency.
Operates effectively with only 33 hours of training data.
Achieves long-duration, pose-controllable human video generation.
Abstract
Generating temporally coherent, long-duration videos with precise control over subject identity and movement remains a fundamental challenge for contemporary diffusion-based models, which often suffer from identity drift and are limited to short video length. We present PoseGen, a novel framework that generates human videos of extended duration from a single reference image and a driving video. Our contributions include an in-context LoRA finetuning design that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, we introduce a segment-interleaved generation strategy, where non-overlapping segments are first generated with improved background consistency through a shared KV-cache mechanism, and then stitched into a continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
