PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao; Shaohao Rui; Kaining Ying; Bo Zheng; Chuanhao Li; Mingmin Chi; Kaipeng Zhang

arXiv:2603.25730·cs.CV·March 27, 2026

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang

PDF

Open Access

TL;DR

PackForcing introduces a hierarchical KV-cache strategy that enables efficient, long-duration video generation from short video training, achieving high temporal coherence and scalability on limited hardware.

Contribution

The paper proposes a novel three-partition KV-cache framework with dynamic context selection and temporal alignment, allowing high-quality long video synthesis from short video data.

Findings

01

Generates 2-minute videos at 16 FPS on a single GPU.

02

Achieves a 24x temporal extrapolation from 5s to 120s.

03

Outperforms previous methods in temporal consistency and dynamic degree.

Abstract

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Pose and Action Recognition