TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Zhixuan Liu; Peter Schaldenbrand; Yijun Li; Long Mai; Aniruddha Mahapatra; Cusuh Ham; Jean Oh; Jui-Hsien Wang

arXiv:2603.27520·cs.CV·March 31, 2026

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang

PDF

TL;DR

TokenDial introduces a method for continuous, attribute-specific control in text-to-video models by learning semantic offsets in token space, enabling predictable and high-quality edits without retraining.

Contribution

It proposes a novel approach to attribute control in text-to-video generation through learned token offsets, enhancing controllability without retraining the entire model.

Findings

01

Achieves stronger controllability than existing methods.

02

Produces higher-quality, predictable edits across diverse attributes.

03

Validated by extensive quantitative and human evaluations.

Abstract

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.