Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Binyuan Huang; Yuning Lu; Weinan Jia; Hualiang Wang; Mu Liu; Daiqing Yang

arXiv:2604.03738·cs.CV·April 7, 2026

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Binyuan Huang, Yuning Lu, Weinan Jia, Hualiang Wang, Mu Liu, Daiqing Yang

PDF

1 Repo

TL;DR

This paper introduces PoCo, a novel position embedding method that enhances multi-reference and multi-shot video generation by reducing reference confusion and improving character consistency.

Contribution

PoCo incorporates position encoding as an additional context control, enabling precise token matching and better handling of similar reference images in video generation.

Findings

01

PoCo improves cross-shot consistency in generated videos.

02

PoCo enhances reference fidelity compared to baseline models.

03

The method effectively reduces reference confusion in multi-reference scenarios.

Abstract

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

byhuang123/PoCo
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.